A systematic approach to normalization in probabilistic models
Abstract
Every information retrieval (IR) model embeds in its scoring function a form of term frequency (TF) quantification. The contribution of the term frequency is determined by the properties of the function of the chosen TF quantification, and by its TF normalization. The first defines how independent the occurrences of multiple terms are, while the second acts on mitigating the a priori probability of having a high term frequency in a document (estimation usually based on the document length). New test collections, coming from different domains (e.g. medical, legal), give evidence that not only document length, but in addition, verboseness of documents should be explicitly considered. Therefore we propose and investigate a systematic combination of document verboseness and length. To theoretically justify the combination, we show the duality between document verboseness and length. In addition, we investigate the duality between verboseness and other components of IR models. We test these new TF normalizations on four suitable test collections. We do this on a well defined spectrum of TF quantifications. Finally, based on the theoretical and experimental observations, we show how the two components of this new normalization, document verboseness and length, interact with each other. Our experiments demonstrate that the new models never underperform existing models, while sometimes introducing statistically significantly better results, at no additional computational cost.
Keywords
Verboseness hypothesis TF normalization Smoothing1 Introduction
The development of retrieval models is one of the key aspects of research in information retrieval (IR). The IR models arise from experimental observations about the use of the language, predominantly on collections of documents primarily composed of news corpora. Today, with the almost total digitization of most text produced, it is clear that the textual documents are not just news and that different collections require different approaches (Hanbury and Lupu 2013). Consequently, the field has been driven to deal with different kinds of information types, demonstrated by the creation of new and more domain specific initiatives in the main IR evaluation campaigns: TREC, NTCIR, CLEF, and FIRE. Now, thanks to the observations made in the context of these evaluation campaigns, we are able to revisit some of the original assumptions and extend the models to integrate other collection statistics that reflect the different use of the language in different domains.
Every IR model boils down to a scoring function in which we can distinguish a component that increases with the number of occurrences of a term in a document (a term frequency component, \({\text {TF}}\)) and a component that decreases with the commonality of a term (an inverse document frequency component, \({\text {IDF}}\)). In this paper we focus on the \({\text {TF}}\) component. Its normalization, first introduced by Robertson et al. (1994) for BM25, and then generalized by Singhal et al. (1996) for a generic model, consists in adjusting the withindocument term frequency (\(\textit{tf}_d\)) based on the ratio between the document length (\(l_d\)) and its expectation (\(\mathrm {E}_{\mathcal {D}}[l_d]\)), called pivoted document length normalization. The work of Singhal et al. is motivated by the experimental observation that the length pattern of the retrieved documents should match the pattern of the relevant documents. Robertson et al. justify this normalization, later declared as ‘soft’ for the mitigation effect provided by the division by the mean, by introducing two contrasting hypotheses (Robertson and Zaragoza 2009), named verboseness and multitopicality: (a) the verboseness hypothesis states that some authors need more words to explain something that could have been explained with fewer; (b) the multitopicality hypothesis states that the reason why more words are required is because the author has covered more ground. While the first hypothesis suggests a document should be normalized by its length, the second suggests the contrary.
Recently, Lipani et al. (2015) have brought back to the attention of the IR community this discussion, pointing out that another collection statistic could be embedded in the \({\text {TF}}\) normalization of BM25. This new statistic measures a kind of verboseness, the repetitiveness of terms in a document, and leads to the achievement of performance better than the standard BM25.
In this paper we address this new observation from the perspective of the established models, and provide a new, general theory. Before doing that, a few general observations are in order.
 \(\textit{tf}_d\)

withindocument term frequency; frequent is good
 \(P_D(tc)\)

documentbased term prob. (aka \({\text {IDF}}(t,c) = \log (P_D(tc))\)); rare is good
 P(tc)

occurrencebased term probability (LM mixture)
 \(l_d\)

document length; to promote short documents
1.1 Notation
\(\mathcal {T}\)  set of terms in the collection 
\(\mathcal {D}\)  set of documents in the collection 
t  a term \(t\in \mathcal {T}\) 
d  a document \(d\in \mathcal {D}\) 
\(\mathcal {T}\)  number of terms 
\(\mathcal {D}\)  number of documents 
\(l_c\)  length of collection (number of term occurrences) 
\(l_t\)  number of occurrences of the term t in the collection, here also called term length (aka collection frequency) 
\(\mathcal {D}_t\)  set of documents where t occurs 
\(\mathcal {T}_d\)  set of terms in d 
\(\mathcal {D}_t\)  number of documents where t occurs (aka document frequency, \({\text {df}}(t)\)) 
\(\mathcal {T}_d\)  number of distinct terms in d 
\(l_d\)  length of document d (number of term occurrences, note \(l_d \ge \mathcal {T}_d\)) 
\(\mathrm {E}_{\mathcal {D}_t}[\textit{tf}_d] = l_t/\mathcal {D}_t\)  average frequency of term t in the documents in which the term occurs 
\(\mathrm {E}_{\mathcal {T}_d}[\textit{tf}_d]=l_d/\mathcal {T}_d\)  average term frequency of terms that occur in document d 
\(\bar{l}_d := \mathrm {E}_{\mathcal {D}}[l_d] = l_c/\mathcal {D}\)  average document length 
\(\bar{l}_t := \mathrm {E}_{\mathcal {T}}[l_t] = l_c/\mathcal {T}\)  average term length 
Note that there are two notions regarding “average term frequency”, \(\mathrm {E}_{\mathcal {D}_t}[\textit{tf}_d]\) and \(\mathrm {E}_{\mathcal {T}_d}[\textit{tf}_d]\). In the first case the average is performed fixing t and averaging across the documents \(\mathcal {D}_t\) containing t, and in the second case the average is performed fixing d and averaging across the terms \(\mathcal {T}_d\) contained therein.
\(P(t)=P_L(t)=l_t/l_c\)  location based probability of \(t\in \mathcal {T}\) 
\(P(d)=P_L(d)=l_d/l_c\)  location based probability of \(d\in \mathcal {D}\) 
\(P_D(t)=\mathcal {D}_t/\mathcal {D}\)  document based probability of \(t\in \mathcal {T}\) 
\(P_T(d)=\mathcal {T}_d/\mathcal {T}\)  term based probability of \(d\in \mathcal {D}\) 
As can be seen, in this paper, when mentioning probability (P) with no index we refer to the probability based on locations, i.e. the probability defined on the sample space of term occurrences.
1.2 Motivations
In this section we formally introduce the document verboseness and term burstiness. We then motivate their investigation in IR models.
A document is verbose if few terms are repeated many times; its domain is \([1, l_d]\), 1 for nonverbose (no term occurs more then once), and \(l_d\) for maximally verbose (one term is repeated \(l_d\) times).
Intuitively, the more verbose (repetitive) a document is, the higher is the chance to find a high \(\textit{tf}_d\). In other words, a document has a high score just because words are repeated (e.g. spamming), and therefore, one wants to demote verbose documents in the ranking.
A term is bursty if it occurs in few documents many times; its domain is \([1, l_t]\), 1 for a nonbursty term (it occurs only once in each document where it is present), \(l_t\) for maximally bursty (all the occurrences are only in one document).
Intuitively, the more bursty a term is, the higher is the chance to find a high \(\textit{tf}_d\). In other words, a bursty term occurs in fewer documents than a nonbursty (a normal) term, and therefore, one wants to promote documents containing bursty terms.
Instead of verboseness and burstiness, scoring functions most often use normalization of the \(\textit{tf}_d\) based on the document length \(l_d\) (e.g. in the TF component of BM25 and in some versions of TFIDF) .
It is surprising that IR models are keen to capture the \(\hat{l}_d\), but seem to hide away verboseness and burstiness, i.e. there is no parameter explicitly associated with these properties. However we observe that some IR models implicitly use these normalizations.
We investigate which IR models capture verboseness and burstiness, and how the parameters can be made explicit or added. Motivated by the work of Lipani et al. (2015), we formally justify verboseness from its duality with the document length normalization. As a supportive case we also present its duality with the concept of burstiness (Roelleke 2013), and term length (aka collection frequency).
1.3 Contributions and structure
The main contributions of this paper are: (1) The inclusion of document verboseness as an explicit parameter in TF quantifications, showing that verboseness is to be viewed in a similar way as the document length in the TF normalizations; (2) An extensive set of experiments capturing a welldefined spectrum of TF quantifications, whose results for logbased and BM25based TF quantifications deliver a significant contribution to insights into the effect of TF quantifications, even beyond the TF normalization variants; (3) Theoretical justifications for the way document verboseness and length are combined, considering the dualities between verboseness and other parameters (including the burstiness of terms).
The remainder of the paper is structured as follows: in Sect. 2 we present the background. In Sect. 3, the main contribution of the paper, namely combining document verboseness and length into the normalization parameter \(K_d\) of the TF quantification, is presented. We next review in Sect. 4 the probabilistic foundations of IR models. This highlights the role of parameters such as verboseness, burstiness and document length, and the theoretical justification of \(\text {TF}_{\text {BM25}}\)IDF. In Sect. 5, we report the experimental setup and results, followed by Sect. 6 dedicated to the discussion of the results. Section 7 concludes the paper.
2 Background
The discussion about the TF normalization was initiated by Robertson and Zaragoza (2009), introducing the two hypotheses: verboseness and multitopicality and then followed by the work of Singhal et al. (1996) where the document length pivotization is justified experimentally. Not much work has been done on the multitopicality hypothesis, but some for the verboseness hypothesis. However, the problem of how to weight terms dates back further, to the work of Salton and Buckley (1988). Na et al. (2008) introduce the concept of repetitiveness to derive a smoothing method for Language Modeling, showing an improvement with respect to other smoothing methods.
Following other work on the TF normalization issues, He and Ounis (2005a) apply the Dirichlet priors to the TF normalization following the idea of Amati and Van Rijsbergen (2002), and test it on different test collections (He and Ounis 2003, 2005b). Lv and Zhai pointed out that the TF quantification based on document length excessively penalizes very long documents due to its lower bound, a problem mitigated by leveraging the TF normalization by adding a constant (Lv and Zhai 2011b). They also pointed out that in case of BM25 it can be mitigated by adding a constant to the TF normalization (Lv and Zhai 2011c). Rousseau and Vazirgiannis (2013) generalized the previously mentioned TF normalizations through functional composition. Lv and Zhai (2011a) estimate dynamically the parameter \(k_1\) of BM25, based on a proposed information gain measure.
Another way of approaching the length normalization issue is to consider retrieval of the the individual passages (Robertson and Walker 1999). However, this use of passages to address length normalization is theoretically unjustified and introduces a series of decision points (size and nature of passages) that are not the focus of this current study.
3 TF normalisations
Considering the remaining elements, \(\bar{a}_t\), \(\breve{a}_t\), \(\bar{a}_d\) and \(\breve{a}_d\), we can think of them as defining an average document \(\bar{d} = [\bar{a}_{t_1}\,\ldots \,\bar{a}_{t_{\mathcal {T}}}]\), an elite average document \(\breve{d} = [\breve{a}_{t_1}\,\ldots \,\breve{a}_{t_{\mathcal {T}}}]\), an average term \(\bar{t} = [\bar{a}_{d_1}\,\ldots \,\bar{a}_{d_{\mathcal {D}}}]\), and an elite average term \(\breve{t} = [\breve{a}_{d_1}\,\ldots \,\breve{a}_{d_{\mathcal {D}}}]\). Moreover, we observe also that the elite average document is equal to \(\breve{d} = [b_{t_1}\,\ldots \,b_{t_{\mathcal {T}}}]\) and the elite average term is equal to \(\breve{t} = [v_{d_1}\,\ldots \,v_{d_{\mathcal {D}}}]\).
So, now, for each row d and for each column t we have a sum, an average, and an elite average. To obtain a collectionlevel statistic, we have to aggregate again, calculating sums and averages (common and elite averages are identical now, because all rows and all columns have a nonzero aggregated value).
A graphical representation of the calculations performed in this section is shown in Fig. 1.
3.1 Duality: document verboseness and length
The dualities between average document verboseness and average document length justify the combination of parameters as formalized in the definition capturing the normalization variants of \(K_d\):
Definition 1

\(\ddot{K}_d\): the nonelite normalization comprises the nonelite pivots \(\ddot{l}_d\) and \(\ddot{v}_d\).

\(\hat{K}_d\): the elite normalization comprises the elite pivots \(\hat{l}_d\) and \(\hat{v}_d\).

The expression \({\text {pivdl}}\), pivoted document length, denotes one of the two:
To summarize, there are four variants of the pivotization factor \(K_d\): nonelite disjunctive denoted as \(\ddot{K}_{\vee }\), nonelite conjunctive denoted as \(\ddot{K}_{\wedge }\), and the respective elite variants \(\hat{K}_{\vee }\) and \(\hat{K}_{\wedge }\). The experiments emphasize the analysis of the behavior of these four variants.
3.2 Example of calculation of the pivotizations
The next example illustrates the arithmetic to compute the pivoted document verboseness and length.
Example 1
The elite average verboseness is the average over the verboseness values of the documents. For example, let \(\breve{v}_d=5/2\) be the elite verboseness.
3.3 Other dualities
Overall, the discussion supports the case to consider verboseness as a documentspecific parameter, whereas traditional IR focuses on the pivoted document length only.
3.4 Summary
List of all four dual properties
Document verboseness  \(v_d := l_d/\mathcal {T}_d\) 
Document length  \(l_d := l_d/\mathcal {D}_d\) (noting that \(\mathcal {D}_d=1\)) 
Term burstiness  \(b_t := l_t/\mathcal {D}_t\) 
Term length  \(l_t := l_t/\mathcal {T}_t\) (noting that \(\mathcal {T}_t=1\)) 
4 Probabilistic derivation of IR models
4.1 Observations about the \({\text {TF}}\) component
The following definition formalizes the welldefined spectrum of \({\text {TF}}\) quantifications (Roelleke et al. 2015).
Definition 2
With this understanding of what the TF stands for, namely a factor modeling a dependence assumption, the role of \(K_d\) is to tune the dependence assumption. For \(K_d>1\), that is for long documents, \({\text {TF}}(t,d)\) decreases, i.e. the dependence increases. This means that in long documents, the multiple term occurrences are more dependent than in short documents. This makes perfect sense when imagining a long document that repeats some terms many times.
The following equation indicates the difference between the standard \(K_d\) as known for BM25 [as shown in Eq. (26)], and the systematic extension proposed and investigated in this paper:the pivoted document length (\({\text {pivdl}}\)) and
the pivoted document verboseness (\(\text {pivdv}\)).
4.2 Observations about the \({\text {IDF}}\) component
4.3 LM and TFIDF
In summary, in this section we have explored the relationship between TFIDF and LM. Both models apply a mixture: TFIDF for estimating P(tq, c), and LM for estimating P(td, c). Moreover, both models involve the component \(b_t/\bar{l}_d \cdot P_D(t)\) measuring the discriminativeness of the term, where burstiness is made explicit.
The mixture assumption for P(tq, c) leads to IDF and it becomes clear why IDF is seen as capturing burstiness in an “implicit” way (Church and Gale 1999). The Dirichletbased mixture for P(td, c), usually only associated with the document length, is extended with the document verboseness. This extension is done analogously to the way the TF quantification has been extended for the TFIDF models.
5 Experiments
Test collection’s information about the collection size \(\mathcal {D}\), number of terms \(\mathcal {T}\), collection length \(l_c\), average document length \(\bar{l}_d\), average verboseness \(\bar{v}_d\), elite average verboseness \(\breve{v}_d\), average term length \(\bar{l}_t\), average burstiness \(\bar{b}_t\), and elite average burstiness \(\breve{b}_t\)
Corpus  EC  Challenge  \(\mathcal {D}\)  \(\mathcal {T}\)  \(l_c\) 

\(\bar{l}_d\)  \(\bar{v}_d\)  \(\breve{v}_d\downarrow\)  
\(\bar{l}_t\)  \(\bar{b}_t\)  \(\breve{b}_t\)  
Aquaint  TREC  HARD’05  1,033,461  647,280  282,858,247 
273.700  436.995  1.519  
436.995  273.700  1.384  
Disks 4&5  TREC  Ad Hoc 8  528,106  737,963  156,226,039 
295.823  211.699  1.575  
211.699  295.823  1.377  
eHealth’14  CLEF  eHealth’14  1,104,298  1,103,947  685,458,908 
620.917  308.294  1.900  
308.294  620.917  1.349  
.GOV  TREC  Web’02  1,214,592  2,937,251  1,770,120,644 
1,457.379  602.645  4.830  
602.645  1,457.379  3.012 
5.1 Setup and materials
To test the \({\text {TF}}\) normalization variants on the different kinds of \({\text {TF}}\) quantifications, we used 4 test collections: TREC HARD 2005, TREC Ad Hoc 8, CLEF eHealth 2014, and TREC Web 2002. Details and corpora properties shown in Table 2. The test collections have been purposefully chosen with a high degree of variability of \(\breve{v}_d\). In this way we can observe the different use of the language in different domains (e.g. we observe that in .GOV on average a term is repeated 218% more times than in the Aquaint collection). We developed^{2} the tested IR models on the IR platform Terrier^{3} 4.2. All the documents have been preprocessed using the English tokenizer and Porter stemmer of the Terrier search engine. All the topics, when multiple lengths are available in the test collections, are of the shortest kind.

16 models based on TFIDF variants: 4 \({\text {TF}}\) normalizations for each of the 4 \({\text {TF}}\) quantifications defined in Definition 2. Each model is identified by its \({\text {TF}}\) quantification, \(\text {TF}_{\text {total}}\), \(\text {TF}_{\text {log}}\), \(\text {TF}_{\text {BM25}}\), and \(\text {TF}_{\text {constant}}\) and kind of \({\text {TF}}\) normalization applied: nonelite disjunctive \(\ddot{K}_{\vee ,d}\), nonelite conjunctive \(\ddot{K}_{\wedge ,d}\), elite disjunctive \(\hat{K}_{\vee ,d}\) and elite conjunctive \(\hat{K}_{\wedge ,d}\).

4 models based on DLM: Each Dirichletbased mixture is identified by its kind of \(\lambda _{d}\) normalization applied: nonelite disjunctive \(\ddot{\lambda }_{\vee ,d}\), nonelite conjunctive \(\ddot{\lambda }_{\wedge ,d}\), elite disjunctive \(\hat{\lambda }_{\vee ,d}\) and elite conjunctive \(\hat{\lambda }_{\wedge ,d}\).

4 models based on the TF\(\text {IDF}_\text {L}\): Each Dirichletbased mixture is identified by its kind of \(\lambda _{q}\) normalization applied: nonelite disjunctive \(\ddot{\lambda }_{\vee ,q}\), nonelite conjunctive \(\ddot{\lambda }_{\wedge ,q}\), elite disjunctive \(\hat{\lambda }_{\vee ,q}\) and elite conjunctive \(\hat{\lambda }_{\wedge ,q}\). As \({\text {TF}}\) component, we select the nonnormalized \(\text {TF}_{\text {total}}\).
The TF normalization of each model presents 3 parameters: \(k_1\), b and the new a introduced in this paper. The DLM and TF\(\text {IDF}_\text {L}\) based models present 2 parameters: b and a. Our experiments focus on the parameter a. For \(k_1\) and b, there are two ways of selecting their values: using the standard values from the literature, or identifying trained values. For the models based on the TFIDF variants, the standard parameters for \(\text {TF}_{\text {BM25}}\) are \(k_1=1.2\) and \(b=0.7\) (Robertson et al. 1994). The standard parameter for \(\text {TF}_{\text {total}}\) and \(\text {TF}_{\text {constant}}\) is \(b=0\) that simplifies \(K_d\) to a constant. In this case we set \(k_1=1\), because it is easy to demonstrate that to change the parameter \(k_1\), as long as \(k_1>0\), does not change the rank of the retrieved documents for these two quantifications. The same set of parameter values are set for the standard \(\text {TF}_{\text {log}}\) (\(b=0\), \(k_1=1\)). For the models based on the DLM, the standard parameters are \(k_1=1\) and \(b=0\), which reduces to the standard definition of DLM (Zhai and Lafferty 2001). For the models based on the LM variant derived by TFIDF, the standard parameters are \(k_1=+\infty\), which reduces to the standard TFIDF model with non normalized \(\text {TF}_{\text {total}}\) quantification.
To identify trained values, the parameters of each model have been spanned as follows: \(a,b \in [0, 1]\) at steps of 0.1, and \(k_1 \in [0,5]\), from 0 to 1 at steps decided by the function 1 / n with \(n \in \{1,...,50\}\), and from 1 to 5 at steps of 0.1. The trained values are obtained maximizing the mean over the topics of the selected evaluation measure. For every model’s configuration that requires training we perform a fivefold cross validation.
The IR evaluation measures employed are \(\text {AP}\), \(\text {NDCG}\) and \(\text {P@10}\).
5.2 Model candidates/structure
 1.
Pivotization: elite pivotization or nonelite pivotization for document verboseness and length;
 2.
Normalization: conjunctive (\(\wedge\)) or disjunctive (\(\vee\)) combination of pivoted document verboseness and length into \(K_d\);
 3.
Quantification: \(\text {TF}_{\text {total}}\), \(\text {TF}_{\text {log}}\), \(\text {TF}_{\text {BM25}}\), or \(\text {TF}_{\text {constant}}\);
 4.
Parameter Settings: standard (S) or trained (T) parameters.
 1.
Pivotization: elite pivotization or nonelite pivotization for document verboseness and length;
 2.
Normalization: conjunctive (\(\wedge\)) or disjunctive (\(\vee\)) combination of pivoted document verboseness and length into \(\lambda _d\);
 3.
Parameter Settings: standard (S) or trained (T) parameters.
 1.
Pivotization: elite pivotization or nonelite pivotization for term length and burstiness;
 2.
Normalization: conjunctive (\(\wedge\)) or disjunctive (\(\vee\)) combination of pivoted term length and burstiness into \(\lambda _q\);
 3.
Parameter Settings: standard (S) or trained (T) parameters.
5.3 Results
 1.
Document Verboseness versus Length: show a certain independence as shown by the shape of the distributions in Fig. 3;
 2.
Pivotization: for TFIDF models the elite pivotization is overall better than the nonelite one; for DLM models the nonelite pivotization performs better.
 3.
Normalization: for TFIDF models the combination of document verboseness and length achieves significantly better results, especially when combined in a conjunctive fashion; for DLM models the combination of document verboseness and length rarely achieves statistically significance;
 4.
TFQuantification: \(\text {TF}_{\text {BM25}}\) appears best, with \(\text {TF}_{\text {log}}\) close behind;
 5.
Standard versus Trained parameter: in both parameter configurations, standard and trained, the use of verboseness makes the model achieve better results. On the other hand, the use of term length most of the time has a negligible impact.
Comparison of the scores obtained with the TFIDF model candidates with each \({\text {TF}}\) normalization using the nonelite and elite pivotization for the HARD 2005 test collection
P  Q  K  C  k1  b  a  \(\text {AP}\)  \(\text {NDCG}\)  \(\text {P@10}\) 

Nonelite  \(\text {TF}_{\text {total}}\)  S  –  \(>0\)  0.0  –  0.0721  0.2936  0.1920 
T  –  \(>0\)  0.5  –  0.0900 \(\dagger\)  0.3201 \(\dagger\)  0.2160  
\(\vee\)  \(>0\)  0.9  0.9  0.0904 \(\dagger\)  0.3223 \(\dagger \, \ddagger\)  0.2200  
\(\wedge\)  \(>0\)  1.0  0.6  0.0942 \(\dagger \, \ddagger\)  0.3277 \(\dagger \, \ddagger\)  0.2380 \(\ddagger\)  
\(\text {TF}_{\text {log}}\)  S  –  1.0  0.0  –  0.1614  0.4424  0.4160  
T  –  0.2  0.3  –  0.2005 \(\dagger\)  0.4799 \(\dagger\)  0.4360  
\(\vee\)  0.2  0.4  0.2  0.2010 \(\dagger\)  0.4801 \(\dagger\)  0.4320  
\(\wedge\)  5.0  0.8  0.7  0.2003 \(\dagger\)  0.4813 \(\dagger\)  0.4400  
\(\text {TF}_{\text {BM25}}\)  S  –  1.2  0.7  –  0.1848  0.4563  0.3660  
T  \(\vee\)  1.2  0.7  0.6  0.1898  0.4584  0.4280 \(\dagger\)  
–  1.5  0.3  –  0.2023 \(\dagger\)  0.4797 \(\dagger\)  0.4440 \(\dagger\)  
\(\vee\)  1.9  0.4  0.5  0.2030 \(\dagger\)  0.4802 \(\dagger\)  0.4480 \(\dagger\)  
\(\wedge\)  3.2  0.4  0.3  0.2032 \(\dagger\)  0.4812 \(\dagger\)  0.4540 \(\dagger\)  
\(\text {TF}_{\text {constant}}\)  S  –  \(>0\)  0.0  –  0.0613  0.2436  0.1500  
T  –  \(>0\)  0.1  –  0.0735 \(\dagger\)  0.2744 \(\dagger\)  0.1620  
\(\vee\)  \(>0\)  0.2  0.7  0.0742 \(\dagger\)  0.2756 \(\dagger\)  0.1620  
\(\wedge\)  \(>0\)  0.1  0.0  0.0740 \(\dagger\)  0.2745 \(\dagger\)  0.1660  
Elite  \(\text {TF}_{\text {total}}\)  S  –  \(>0\)  0.0  –  0.0721  0.2936  0.1920 
T  –  \(>0\)  0.5  –  0.0900 \(\dagger\)  0.3201 \(\dagger\)  0.2160  
\(\vee\)  \(>0\)  1.0  0.6  0.0946 \(\dagger \, \ddagger\)  0.3283 \(\dagger \, \ddagger\)  0.2380 \(\ddagger\)  
\(\wedge\)  \(>0\)  1.0  0.6  0.0942 \(\dagger \, \ddagger\)  0.3277 \(\dagger \, \ddagger\)  0.2380 \(\ddagger\)  
\(\text {TF}_{\text {log}}\)  S  –  1.0  0.0  –  0.1614  0.4424  0.4160  
T  –  0.2  0.3  –  0.2005 \(\dagger\)  0.4799 \(\dagger\)  0.4360  
\(\vee\)  0.2  0.6  0.5  0.2013 \(\dagger\)  0.4798 \(\dagger\)  0.4300  
\(\wedge\)  0.2  0.8  0.7  0.2003 \(\dagger\)  0.4810 \(\dagger\)  0.4400  
\(\text {TF}_{\text {BM25}}\)  S  –  1.2  0.7  –  0.1848  0.4563  0.3660  
T  \(\vee\)  1.2  0.7  0.6  0.2012 \(\dagger\)  0.4759 \(\dagger\)  0.4480 \(\dagger\)  
–  1.5  0.3  –  0.2023 \(\dagger\)  0.4797 \(\dagger\)  0.4440 \(\dagger\)  
\(\vee\)  1.5  0.5  0.5  0.2034 \(\dagger\)  0.4807 \(\dagger\)  0.4420 \(\dagger\)  
\(\wedge\)  1.9  0.8  0.7  0.2037 \(\dagger\)  0.4833 \(\dagger\)  0.4400 \(\dagger\)  
\(\text {TF}_{\text {constant}}\)  S  –  \(>0\)  0.0  –  0.0613  0.2436  0.1500  
T  –  \(>0\)  0.1  –  0.0735 \(\dagger\)  0.2744 \(\dagger\)  0.1620  
\(\vee\)  \(>0\)  0.1  0.0  0.0735 \(\dagger\)  0.2744 \(\dagger\)  0.1620  
\(\wedge\)  \(>0\)  0.1  0.0  0.0740 \(\dagger\)  0.2745 \(\dagger\)  0.1660 
Comparison of the scores obtained with the TFIDF model candidates with each \({\text {TF}}\) normalization using the nonelite and elite pivotization for the Ad Hoc 8 test collection
P  Q  K  C  k1  b  a  \(\text {AP}\)  \(\text {NDCG}\)  \(\text {P@10}\) 

Nonelite  \(\text {TF}_{\text {total}}\)  S  –  \(>0\)  0.0  –  0.0635  0.2762  0.1360 
T  –  \(>0\)  0.5  –  0.0977 \(\dagger\)  0.3306 \(\dagger\)  0.2240 \(\dagger\)  
\(\vee\)  \(>0\)  0.5  0.0  0.0977 \(\dagger\)  0.3306 \(\dagger\)  0.2240 \(\dagger\)  
\(\wedge\)  \(>0\)  1.0  0.5  0.1076 \(\dagger \, \ddagger\)  0.3491 \(\dagger \, \ddagger\)  0.2400 \(\dagger\)  
\(\text {TF}_{\text {log}}\)  S  –  1.0  0.0  –  0.1753  0.4568  0.3360  
T  –  0.1  0.3  –  0.2478 \(\dagger\)  0.5381 \(\dagger\)  0.4280 \(\dagger\)  
\(\vee\)  0.1  0.9  0.9  0.2563  0.5415  0.4560  
\(\wedge\)  0.1  0.9  0.5  0.2625 \(\dagger \, \ddagger\)  0.5475 \(\dagger\)  0.4620 \(\dagger \, \ddagger\)  
\(\text {TF}_{\text {BM25}}\)  S  –  1.2  0.7  –  0.2433  0.5193  0.4680  
T  \(\vee\)  1.2  0.7  0.8  0.2614 \(\dagger\)  0.5438 \(\dagger\)  0.4480  
–  0.6  0.3  –  0.2614 \(\dagger\)  0.5447 \(\dagger\)  0.4520  
\(\vee\)  0.6  0.3  0.1  0.2616 \(\dagger\)  0.5441 \(\dagger\)  0.4620 \(\ddagger\)  
\(\wedge\)  2.7  0.6  0.5  0.2681 \(\dagger \, \ddagger\)  0.5523 \(\dagger \, \ddagger\)  0.4660  
\(\text {TF}_{\text {constant}}\)  S  –  \(>0\)  0.0  –  0.1550  0.4071  0.2060  
T  –  \(>0\)  0.1  –  0.1868 \(\dagger\)  0.4387 \(\dagger\)  0.3260 \(\dagger\)  
\(\vee\)  \(>0\)  0.1  0.9  0.1880 \(\dagger\)  0.4452 \(\dagger \, \ddagger\)  0.3240 \(\dagger\)  
\(\wedge\)  \(>0\)  0.2  0.4  0.1922 \(\dagger\)  0.4462 \(\dagger \, \ddagger\)  0.3260 \(\dagger\)  
Elite  \(\text {TF}_{\text {total}}\)  S  –  \(>0\)  0.0  –  0.0635  0.2762  0.1360 
T  –  \(>0\)  0.5  –  0.0977 \(\dagger\)  0.3306 \(\dagger\)  0.2240 \(\dagger\)  
\(\vee\)  \(>0\)  1.0  0.7  0.1056 \(\dagger \, \ddagger\)  0.3469 \(\dagger \, \ddagger\)  0.2380 \(\dagger\)  
\(\wedge\)  \(>0\)  1.0  0.5  0.1076 \(\dagger \, \ddagger\)  0.3491 \(\dagger \, \ddagger\)  0.2400 \(\dagger\)  
\(\text {TF}_{\text {log}}\)  S  –  1.0  0.0  –  0.1753  0.4568  0.3360  
T  –  0.1  0.3  –  0.2478 \(\dagger\)  0.5381 \(\dagger\)  0.4280 \(\dagger\)  
\(\vee\)  0.1  1.0  0.7  0.2521 \(\dagger\)  0.5435 \(\dagger\)  0.4500 \(\dagger \, \ddagger\)  
\(\wedge\)  0.1  0.8  0.6  0.2562 \(\dagger \, \ddagger\)  0.5474 \(\dagger \, \ddagger\)  0.4540 \(\dagger \, \ddagger\)  
\(\text {TF}_{\text {BM25}}\)  S  –  1.2  0.7  –  0.2433  0.5193  0.4680  
T  \(\vee\)  1.2  0.7  0.6  0.2535 \(\dagger\)  0.5399 \(\dagger\)  0.4700  
–  0.6  0.3  –  0.2614 \(\dagger\)  0.5447 \(\dagger\)  0.4520  
\(\vee\)  0.5  1.0  0.7  0.2638 \(\dagger\)  0.5463 \(\dagger\)  0.4700  
\(\wedge\)  0.6  0.6  0.5  0.2681 \(\dagger \, \ddagger\)  0.5524 \(\dagger \, \ddagger\)  0.4680 \(\ddagger\)  
\(\text {TF}_{\text {constant}}\)  S  –  \(>0\)  0.0  –  0.1550  0.4071  0.2060  
T  –  \(>0\)  0.1  –  0.1868 \(\dagger\)  0.4387 \(\dagger\)  0.3260 \(\dagger\)  
\(\vee\)  \(>0\)  0.1  0.4  0.1878 \(\dagger\)  0.4418 \(\dagger \, \ddagger\)  0.3320 \(\dagger\)  
\(\wedge\)  \(>0\)  0.2  0.4  0.1922 \(\dagger\)  0.4462 \(\dagger \, \ddagger\)  0.3260 \(\dagger\) 
Comparison of the scores obtained with the TFIDF model candidates with each \({\text {TF}}\) normalization using the nonelite and elite pivotization for the eHealth 2014 test collection
P  Q  K  C  k1  b  a  \(\text {AP}\)  \(\text {NDCG}\)  \(\text {P@10}\) 

Nonelite  \(\text {TF}_{\text {total}}\)  S  –  \(>0\)  0.0  –  0.1166  0.3361  0.2640 
T  –  \(>0\)  0.7  –  0.2594 \(\dagger\)  0.5206 \(\dagger\)  0.5580 \(\dagger\)  
\(\vee\)  \(>0\)  0.8  0.4  0.2610 \(\dagger\)  0.5209 \(\dagger\)  0.5540 \(\dagger\)  
\(\wedge\)  \(>0\)  1.0  0.4  0.2699 \(\dagger\)  0.5322 \(\dagger\)  0.5580 \(\dagger\)  
\(\text {TF}_{\text {log}}\)  S  –  1.0  0.0  –  0.2106  0.4637  0.4280  
T  –  0.2  0.7  –  0.4222  0.6701 \(\dagger\)  0.7960 \(\dagger\)  
\(\vee\)  0.4  0.8  0.5  0.4242  0.6729 \(\dagger \, \ddagger\)  0.8000 \(\dagger\)  
\(\wedge\)  1.9  1.0  0.4  0.4260  0.6729 \(\dagger\)  0.8040 \(\dagger\)  
\(\text {TF}_{\text {BM25}}\)  S  –  1.2  0.7  –  0.3729  0.6310  0.7640  
T  \(\vee\)  1.2  0.7  0.0  0.3729  0.6310  0.7640  
–  4.5  0.6  –  0.4022 \(\dagger\)  0.6595 \(\dagger\)  0.7840  
\(\vee\)  4.5  0.6  0.0  0.4022 \(\dagger\)  0.6595 \(\dagger\)  0.7840  
\(\wedge\)  4.5  0.7  0.0  0.4018 \(\dagger\)  0.6542 \(\dagger\)  0.7880  
\(\text {TF}_{\text {constant}}\)  S  –  \(>0\)  0.0  –  0.0474  0.2021  0.1140  
T  –  \(>0\)  0.2  –  0.0755 \(\dagger\)  0.2552 \(\dagger\)  0.2280 \(\dagger\)  
\(\vee\)  \(>0\)  0.0  0.0  0.0840 \(\dagger\)  0.3523 \(\dagger \, \ddagger\)  0.1760 \(\dagger\)  
\(\wedge\)  \(>0\)  0.2  0.2  0.0745 \(\dagger\)  0.2551 \(\dagger\)  0.2260 \(\dagger\)  
Elite  \(\text {TF}_{\text {total}}\)  S  –  \(>0\)  0.0  –  0.1166  0.3361  0.2640 
T  –  \(>0\)  0.7  –  0.2594 \(\dagger\)  0.5206 \(\dagger\)  0.5580 \(\dagger\)  
\(\vee\)  \(>0\)  1.0  0.5  0.2697 \(\dagger\)  0.5316 \(\dagger \, \ddagger\)  0.5820 \(\dagger\)  
\(\wedge\)  \(>0\)  1.0  0.4  0.2699 \(\dagger\)  0.5322 \(\dagger\)  0.5580 \(\dagger\)  
\(\text {TF}_{\text {log}}\)  S  –  1.0  0.0  –  0.2106  0.4637  0.4280  
T  –  0.2  0.7  –  0.4222  0.6701 \(\dagger\)  0.7960 \(\dagger\)  
\(\vee\)  0.2  1.0  0.4  0.4239  0.6713 \(\dagger\)  0.8080 \(\dagger\)  
\(\wedge\)  0.2  1.0  0.4  0.4239  0.6715 \(\dagger\)  0.8060 \(\dagger\)  
\(\text {TF}_{\text {BM25}}\)  S  –  1.2  0.7  –  0.3729  0.6310  0.7640  
T  \(\vee\)  1.2  0.7  0.1  0.3742  0.6320  0.7640  
–  4.5  0.6  –  0.4022 \(\dagger\)  0.6595 \(\dagger\)  0.7840  
\(\vee\)  5.0  1.0  0.5  0.4079 \(\dagger \, \ddagger\)  0.6635 \(\dagger \, \ddagger\)  0.7900  
\(\wedge\)  5.0  1.0  0.4  0.4092 \(\dagger \, \ddagger\)  0.6607 \(\dagger\)  0.8000  
\(\text {TF}_{\text {constant}}\)  S  –  \(>0\)  0.0  –  0.0474  0.2021  0.1140  
T  –  \(>0\)  0.2  –  0.0755 \(\dagger\)  0.2552 \(\dagger\)  0.2280 \(\dagger\)  
\(\vee\)  \(>0\)  0.2  0.0  0.0755 \(\dagger\)  0.2552 \(\dagger\)  0.2280 \(\dagger\)  
\(\wedge\)  \(>0\)  0.2  0.2  0.0745 \(\dagger\)  0.2551 \(\dagger\)  0.2260 \(\dagger\) 
Comparison of the scores obtained with the TFIDF model candidates with each \({\text {TF}}\) normalization using the nonelite and elite pivotization for the Web 2002 test collection
P  Q  K  C  k1  b  a  \(\text {AP}\)  \(\text {NDCG}\)  \(\text {P@10}\) 

Nonelite  \(\text {TF}_{\text {total}}\)  S  –  \(>0\)  0.0  –  0.0171  0.1387  0.0260 
T  –  \(>0\)  0.9  –  0.0568 \(\dagger\)  0.2642 \(\dagger\)  0.0880 \(\dagger\)  
\(\vee\)  \(>0\)  0.9  0.4  0.0577 \(\dagger\)  0.2713 \(\dagger \, \ddagger\)  0.0820 \(\dagger\)  
\(\wedge\)  \(>0\)  1.0  0.4  0.0563 \(\dagger\)  0.2732 \(\dagger\)  0.0800 \(\dagger\)  
\(\text {TF}_{\text {log}}\)  S  –  1.0  0.0  –  0.0603  0.2719  0.1100  
T  –  0.2  0.8    0.1951 \(\dagger\)  0.4799 \(\dagger\)  0.2420 \(\dagger\)  
\(\vee\)  0.2  0.9  0.6  0.1991 \(\dagger\)  0.4803 \(\dagger\)  0.2360 \(\dagger\)  
\(\wedge\)  0.2  0.9  0.2  0.1974 \(\dagger\)  0.4812 \(\dagger\)  0.2360 \(\dagger\)  
\(\text {TF}_{\text {BM25}}\)  S  –  1.2  0.7  –  0.1948  0.4696  0.2380  
T  \(\vee\)  1.2  0.7  0.0  0.1948  0.4696  0.2380  
–  4.1  0.7  –  0.2010  0.4777  0.2520  
\(\vee\)  3.1  0.7  0.1  0.2016  0.4816  0.2420  
\(\wedge\)  5.0  0.8  0.2  0.1923  0.4722  0.2520  
\(\text {TF}_{\text {constant}}\)  S  –  \(>0\)  0.0  –  0.0140  0.1514  0.0140  
T  –  \(>0\)  0.1  –  0.0310 \(\dagger\)  0.2041 \(\dagger\)  0.0500 \(\dagger\)  
\(\vee\)  \(>0\)  0.2  0.3  0.0310 \(\dagger\)  0.2008 \(\dagger\)  0.0500 \(\dagger\)  
\(\wedge\)  \(>0\)  0.1  0.5  0.0311 \(\dagger\)  0.1979 \(\dagger\)  0.0480 \(\dagger\)  
Elite  \(\text {TF}_{\text {total}}\)  S  –  \(>0\)  0.0  –  0.0171  0.1387  0.0260 
T  –  \(>0\)  0.9  –  0.0568 \(\dagger\)  0.2642 \(\dagger\)  0.0880 \(\dagger\)  
\(\vee\)  \(>0\)  1.0  0.4  0.0635 \(\dagger\)  0.2860 \(\dagger \, \ddagger\)  0.0940 \(\dagger\)  
\(\wedge\)  \(>0\)  1.0  0.4  0.0563 \(\dagger\)  0.2732 \(\dagger\)  0.0800 \(\dagger\)  
\(\text {TF}_{\text {log}}\)  S  –  1.0  0.0  –  0.0603  0.2719  0.1100  
T  –  0.2  0.8    0.1951 \(\dagger\)  0.4799 \(\dagger\)  0.2420 \(\dagger\)  
\(\vee\)  0.1  0.9  0.2  0.1989  0.4817  0.2360  
\(\wedge\)  0.1  0.9  0.2  0.1975 \(\dagger\)  0.4816 \(\dagger\)  0.2380 \(\dagger\)  
\(\text {TF}_{\text {BM25}}\)  S  –  1.2  0.7  –  0.1948  0.4696  0.2380  
T  \(\vee\)  1.2  0.7  0.0  0.1948  0.4696  0.2380  
–  4.1  0.7  –  0.2010  0.4777  0.2520  
\(\vee\)  3.6  0.8  0.2  0.2016  0.4808  0.2460  
\(\wedge\)  3.3  1.0  0.4  0.1966  0.4770  0.2500  
\(\text {TF}_{\text {constant}}\)  S  –  \(>0\)  0.0  –  0.0140  0.1514  0.0140  
T  –  \(>0\)  0.1  –  0.0310 \(\dagger\)  0.2041 \(\dagger\)  0.0500 \(\dagger\)  
\(\vee\)  \(>0\)  0.2  0.3  0.0319 \(\dagger\)  0.1988 \(\dagger\)  0.0520 \(\dagger\)  
\(\wedge\)  \(>0\)  0.1  0.5  0.0311 \(\dagger\)  0.1979 \(\dagger\)  0.0480 \(\dagger\) 
Comparison of the scores obtained with the DLM models candidates using the nonelite and elite pivotization
Challenge  P  K  C  b  a  \(\text {AP}\)  \(\text {NDCG}\)  \(\text {P@10}\) 

HARD’05  S  –  1.0  –  0.1912  0.4680  0.4220  
Nonelite  T  \(\vee\)  1.0  0.8  0.1970  0.4801 \(\dagger\)  0.4580 \(\dagger\)  
\(\wedge\)  1.0  0.3  0.1998 \(\dagger\)  0.4806 \(\dagger\)  0.4380  
Elite  T  \(\vee\)  1.0  0.0  0.1912  0.4680  0.4220  
\(\wedge\)  1.0  0.0  0.1912  0.4680  0.4220  
Ad Hoc 8  S  –  1.0  –  0.2583  0.5420  0.4560  
Nonelite  T  \(\vee\)  0.9  0.7  0.2625 \(\dagger\)  0.5481 \(\dagger\)  0.4600  
\(\wedge\)  0.8  0.3  0.2606  0.5448  0.4480  
Elite  T  \(\vee\)  0.9  0.0  0.2589  0.5410  0.4680  
\(\wedge\)  0.9  0.0  0.2587  0.5415  0.4600  
eHealth’14  S  –  1.0  –  0.3863  0.6444  0.7980  
Nonelite  T  \(\vee\)  0.8  0.5  0.3965 \(\dagger\)  0.6468  0.7900  
\(\wedge\)  0.7  0.7  0.4082 \(\dagger\)  0.6616 \(\dagger\)  0.7920  
Elite  T  \(\vee\)  0.8  0.0  0.3939 \(\dagger\)  0.6467  0.7820 \(\dagger\)  
\(\wedge\)  0.7  0.0  0.3927 \(\dagger\)  0.6468  0.7900  
Web’02  S  –  1.0  –  0.1877  0.4617  0.2380  
Nonelite  T  \(\vee\)  0.8  0.0  0.1984 \(\dagger\)  0.4767 \(\dagger\)  0.2580  
\(\wedge\)  0.5  0.1  0.2039 \(\dagger\)  0.4844 \(\dagger\)  0.2600  
Elite  T  \(\vee\)  0.9  0.3  0.2002 \(\dagger\)  0.4785 \(\dagger\)  0.2620  
\(\wedge\)  0.5  0.0  0.2037 \(\dagger\)  0.4836 \(\dagger\)  0.2660 
Comparison of the scores obtained with the TF\(\text {IDF}_\text {L}\) model candidates using the nonelite and elite pivotization
Challenge  P  K  C  b  a  \(\text {AP}\)  \(\text {NDCG}\)  \(\text {P@10}\) 

HARD’05  S  –  –  –  0.0721  0.2936  0.1920  
Nonelite  T  \(\vee\)  1.0  1.0  0.0967 \(\dagger\)  0.3329 \(\dagger\)  0.2120  
\(\wedge\)  1.0  1.0  0.0967 \(\dagger\)  0.3329 \(\dagger\)  0.2120  
Elite  T  \(\vee\)  1.0  1.0  0.0753 \(\dagger\)  0.2994 \(\dagger\)  0.1960  
\(\wedge\)  1.0  1.0  0.0753 \(\dagger\)  0.2994 \(\dagger\)  0.1960  
Ad Hoc 8  S  –  –  –  0.0635  0.2762  0.1360  
Nonelite  T  \(\vee\)  1.0  1.0  0.1500 \(\dagger\)  0.4135 \(\dagger\)  0.2440 \(\dagger\)  
\(\wedge\)  1.0  1.0  0.1500 \(\dagger\)  0.4135 \(\dagger\)  0.2440 \(\dagger\)  
Elite  T  \(\vee\)  1.0  1.0  0.0688 \(\dagger\)  0.2914 \(\dagger\)  0.1480 \(\dagger\)  
\(\wedge\)  1.0  1.0  0.0688 \(\dagger\)  0.2914 \(\dagger\)  0.1480 \(\dagger\)  
eHealth’14  S  –  –  –  0.1166  0.3361  0.2640  
Nonelite  T  \(\vee\)  1.0  1.0  0.1623 \(\dagger\)  0.4177 \(\dagger\)  0.3220  
\(\wedge\)  1.0  1.0  0.1623 \(\dagger\)  0.4177 \(\dagger\)  0.3220  
Elite  T  \(\vee\)  1.0  1.0  0.1231 \(\dagger\)  0.3502 \(\dagger\)  0.2780  
\(\wedge\)  1.0  1.0  0.1231 \(\dagger\)  0.3502 \(\dagger\)  0.2780  
Web’02  S  –  –  –  0.0171  0.1387  0.0260  
Nonelite  T  \(\vee\)  1.0  1.0  0.0249 \(\dagger\)  0.1865 \(\dagger\)  0.0460 \(\dagger\)  
\(\wedge\)  1.0  1.0  0.0249 \(\dagger\)  0.1865 \(\dagger\)  0.0460 \(\dagger\)  
Elite  T  \(\vee\)  1.0  1.0  0.0183 \(\dagger\)  0.1456 \(\dagger\)  0.0280  
\(\wedge\)  1.0  1.0  0.0183 \(\dagger\)  0.1456 \(\dagger\)  0.0280 
P  Q  C  k1  b  a  HARD’05  Ad Hoc 8  eHealth’14  Web’02 

Nonelite  \(\text {TF}_{\text {total}}\)  –  \(>0\)  \(*\)  –  0.0873  0.0927  0.2594  0.0543 
\(\vee\)  \(>0\)  \(*\)  \(*\)  0.0873  0.0927  0.2594  0.0543  
\(\wedge\)  \(>0\)  \(*\)  \(*\)  0.0942  0.1058  0.2699  0.0523  
\(\text {TF}_{\text {log}}\)  –  \(*\)  \(*\)  –  0.2005  0.2436  0.4136  0.1911  
\(\vee\)  \(*\)  \(*\)  \(*\)  0.2293  0.2591  0.6081  0.2058  
\(\wedge\)  \(*\)  \(*\)  \(*\)  0.2257  0.2679  0.5985  0.2048  
\(\text {TF}_{\text {BM25}}\)  \(\vee\)  1.2  0.7  \(*\)  0.2228  0.2718  0.5679  0.2033  
–  \(*\)  \(*\)  –  0.1983  0.2597  0.3987  0.1937  
\(\vee\)  \(*\)  \(*\)  \(*\)  0.2316  0.2671  0.6050  0.2042  
\(\wedge\)  \(*\)  \(*\)  \(*\)  0.2006  0.2634  0.3990  0.1892  
\(\text {TF}_{\text {constant}}\)  –  \(>0\)  \(*\)  –  0.0735  0.1868  0.0727  0.0309  
\(\vee\)  \(>0\)  \(*\)  \(*\)  0.1215  0.2087  0.2647  0.0559  
\(\wedge\)  \(>0\)  \(*\)  \(*\)  0.0740  0.1881  0.0735  0.0291  
Elite  \(\text {TF}_{\text {total}}\)  –  \(>0\)  \(*\)  –  0.0873  0.0927  0.2594  0.0543 
\(\vee\)  \(>0\)  \(*\)  \(*\)  0.1495  0.1206  0.5188  0.0965  
\(\wedge\)  \(>0\)  \(*\)  \(*\)  0.0942  0.1058  0.2699  0.0523  
\(\text {TF}_{\text {log}}\)  –  \(*\)  \(*\)  –  0.2005  0.2436  0.4136  0.1911  
\(\vee\)  \(*\)  \(*\)  \(*\)  0.2268  0.2591  0.6070  0.2060  
\(\wedge\)  \(*\)  \(*\)  \(*\)  0.2265  0.2593  0.6131  0.2062  
\(\text {TF}_{\text {BM25}}\)  \(\vee\)  1.2  0.7  \(*\)  0.2301  0.2573  0.5631  0.2033  
–  \(*\)  \(*\)  –  0.1983  0.2597  0.3987  0.1937  
\(\vee\)  \(*\)  \(*\)  \(*\)  0.2339  0.2718  0.6028  0.2023  
\(\wedge\)  \(*\)  \(*\)  \(*\)  0.2010  0.2636  0.4089  0.1926  
\(\text {TF}_{\text {constant}}\)  –  \(>0\)  \(*\)  –  0.0735  0.1868  0.0727  0.0309  
\(\vee\)  \(>0\)  \(*\)  \(*\)  0.1198  0.2075  0.2645  0.0553  
\(\wedge\)  \(>0\)  \(*\)  \(*\)  0.0740  0.1881  0.0735  0.0291 
Challenge  P  C  DLM  TF\(\text {IDF}_{\text {L}}\) 

HARD’05  Nonelite  \(\vee\)  0.2288  0.1523 
\(\wedge\)  0.1998  0.0967  
Elite  \(\vee\)  0.2258  0.1369  
\(\wedge\)  0.1912  0.0753  
Ad Hoc 8  Nonelite  \(\vee\)  0.2679  0.1600 
\(\wedge\)  0.2539  0.1500  
Elite  \(\vee\)  0.2653  0.0821  
\(\wedge\)  0.2556  0.0688  
eHealth’14  Nonelite  \(\vee\)  0.5740  0.4545 
\(\wedge\)  0.4060  0.1623  
Elite  \(\vee\)  0.5769  0.4116  
\(\wedge\)  0.3927  0.1231  
Web’02  Nonelite  \(\vee\)  0.2051  0.0450 
\(\wedge\)  0.2011  0.0250  
Elite  \(\vee\)  0.2092  0.0393  
\(\wedge\)  0.2010  0.0183 
For each test collections: HARD 2005 in Table 3, Ad Hoc 8 in Table 4, eHealth 2014 in Table 5, and Web 2002 in Table 6, we present the results obtained with the TFIDF model variants and the two pivotizations. In these tables we observe each model with either its standard configuration (S), or its trained configuration (T), obtained taking the configuration that maximizes the evaluation measure \(\text {AP}\). The standard parameters of the normalizations for the TF quantifications: \(\text {TF}_{\text {total}}\), \(\text {TF}_{\text {log}}\) and \(\text {TF}_{\text {constant}}\), have the effect of disabling the normalization component (\(b=0\)). However, for \(\text {TF}_{\text {BM25}}\) this does not happen. Thereby, we can study the effect of the parameter a in its standard parametrization. To do this we extract the best value obtained with the standard \(k_1\) and b by selecting the maximum value of the measure \(\text {AP}\) obtained by varying the parameter a. In case of the trained parameter values instead, for all the \({\text {TF}}\) quantifications, we show in the first row the best result obtained maximizing the \(\text {AP}\) without the use of verboseness in the scoring function (\(a=1\)), and then we show the result obtained when verboseness is added in the scoring function. The tables distinguish between the conjunctive (\(\wedge\)) and disjunctive (\(\vee\)) combinations of document verboseness and length.
\(\text {TF}_{\text {BM25}}\) works generally better than the other \({\text {TF}}\) quantifications, but not for all test collections. For the test collection eHealth 2014 \(\text {TF}_{\text {log}}\) is better.
We also observe that best configuration is achieved using the elite pivotization. The conjunctive combination works generally better than the disjunctive case (24 of 32 experiments better than the disjunctive, all 7 unfavorable cases occur when using the Web 2002 test collection).
In Table 7, we present the results obtained for every test collections using DLM with \(\lambda _{d}\) extended with verboseness. For this model the standard parameter is when \(b=1\), and \(a=0\), which reduces the formula to the standard DLM without verboseness (citealtZhai:2001:SSM:383952.384019). This variant is shown on the first row for every test collection. The subsequent rows present the variant of \(\lambda _{d}\) when combined with verboseness in disjunction and conjunction with nonelite and elite pivots. For this model we observe that the presence of verboseness produces for only one test collection significant improvements. Overall we observe that the nonelite pivotization should be preferred (all the experiments produce better results than the elite one). No difference is observed by using a disjunctive or conjunctive combination of the pivots.
In Table 8, we present the results obtained for every test collections using TF\(\text {IDF}_\text {L}\) model with \(\lambda _{q}\) that combines in a LM fashion the term length and burstiness. For this model the standard parameter is when \(\lambda _q = 1\), which reduces this IR model to a non TFnormalized \(\text {TF}_{\text {total}}\)IDF model. This variant is shown on the first row for every test collection. The following rows present the variant of \(\lambda _{q}\) when combined in disjunction and conjunction with nonelite and elite pivots. We observe that this parametrization produces significantly better results than the standard case, and that the nonelite parametrization should be preferred. Also here, as for DLM, no difference is observed by using a disjunctive or conjunctive combination of the pivots. We also observe that overall the values of the trained parameter a is often equal to 1, which suggests that, for these model variants, the term length does not play an important role in adjusting the document’s score. This is a curious behavior since it is dual to the DLM model, where the document verboseness does not play an important role either.
6 Analysis and discussion
Finally we make some observations across the experimental results about the behavior of the parameter a. Before that however, let us make an observation on the nature of the data at our disposal. Figure 3 shows the distribution of the document verboseness versus document length for the elite and nonelite pivotizations. In both cases we see that verboseness brings additional information compared to document length: the plotted distributions are well spread, away from the first diagonal.
Comparing the two distributions, it is interesting to observe that the nonelite pivotization is significantly more skewed than the elite one: the xaxis of the left plot has a scale in the (0, 0.02) range, while the one on the right plot has a scale that matches the yscale: (0, 4). This supports and grounds our hypothesis that elite pivotization should provide us better means to balance verboseness and document length with parameter a.
The a parameter controls the contribution of elite pivoted verboseness and elite pivoted document length. When \(a<0.5\), the contribution of the document verboseness is higher than the contribution of the document length, and vice versa when \(a>0.5\). Looking at the distribution for the elite pivotizations of the documents, redefining the origin to the point (1, 1) we split the distributions in four quadrants.^{4} We know that whatever a we fix, the documents in the I quadrant will be always demoted to some degree, and in the III quadrant the documents will be always promoted to some degree. So here the question is what happens to the documents in the IV and II quadrant. When to be preferred is the contribution of document verboseness (\(a>0.5\)) more documents with low verboseness (\(\hat{v}_d<1\)) and high length (\(\hat{l}_d>1\)) will be promoted against the documents of the IV quadrant, and when preferred is the contribution of the document length (\(a<0.5\)) the contrary happens. Therefore, the a values, previously listed, should anticorrelate with the ratio of the number of relevant documents between the II quadrant and the IV quadrant. Here the two lists of values sorted by test collection, of a extracted from Tables 3, 4, 5, and 6, for the standard BM25 case with trained a: 0.8, 0.6, 0.4, and 0.0 and ratios: 0.63, 0.86, 1.16 and 4.20, where we observe that they anticorrelate. Therefore if we think that all the documents of the collection should be relevant we should find the a value that mostly balance the proportion of non verbose but long documents with the short but verbose documents. All the test collections but Disks 4&5 have been crawled from the Web. For all of them we can observe that the plots manifest a visible noise. In particular we observe the presence of black dots that are most probably caused by the existance of duplicated documents in the collections. For example, the existance of duplicated documents in the eHealth’14 test collection is a known issue to the eHealth IR community.
In Tables 3, 4, 5, and 6 we observe that the best performing configuration, for both \(\text {TF}_{\text {log}}\) and \(\text {TF}_{\text {total}}\), uses the trained parameters combined in disjunction, in particular in Table 4 these configurations also show statistical significance against both standard configuration and trained configuration when verboseness is not present (\(a=0\)). The elite pivotization performs generally better than the nonelite pivotization. In particular the best performing configurations are with elite pivotization and trained parameters in conjunction. We observe also that in general the elite pivotization weighting role is taken by the parameter a (\(b=1\) means that a full document verboseness and length normalization is applied).
In Fig. 4 we further analyze the best configuration on a per topic basis. Here, we show the difference in \(\text {AP}\) between the \(\text {AP}\) of the trained TF\(_\text {BM25}\)IDF with verboseness combined in conjunction with elite pivots, and the trained classic TF\(_\text {BM25}\)IDF. If the difference is positive the variant with verboseness is better than the classic version.
7 Conclusion
This paper presents an extensive study of \({\text {TF}}\) quantifications and normalizations. The quantifications are with respect to a welldefined spectrum comprising \(\text {TF}_{\text {total}}\), \(\text {TF}_{\text {log}}\), \(\text {TF}_{\text {BM25}}\), and \(\text {TF}_{\text {constant}}\). Each of these \({\text {TF}}\) quantifications reflects a dependence assumption. In particular, \(\text {TF}_{\text {total}}\) and \(\text {TF}_{\text {constant}}\) are the extremes of the quantification spectrum, assuming independence for the former and subsumption for the latter. \(\text {TF}_{\text {BM25}}\) is a relatively strong dependence assumption, and \(\text {TF}_{\text {log}}\) is in the middle between \(\text {TF}_{\text {total}}\) and \(\text {TF}_{\text {BM25}}\). Each of these quantifications incorporates a \({\text {TF}}\) normalization parameter, usually denoted as \(K_d\).
Whereas current approaches regarding \(K_d\) consider only the document length as parameter of \(K_d\), this paper makes the case for \(K_d\) to be a combination of document verboseness and length. There are many heuristic options for how to combine the parameters, and this paper contributes the theoretical foundations leading to a systematic combination of document verboseness and length.
The paper reports results of an experimental study investigating the effect of various settings of \(K_d\) for the four main \({\text {TF}}\) quantifications. The overall finding is that combining document verboseness with document length (either in a conjunctive or disjunctive way) improves retrieval quality when compared to results considering document length only.
We expand this in two directions, first by exploring a similar normalization in the context of LM and second a similar normalization in the context of TFIDF. For the former, we include document verboseness into the Dirichlet smoothing where nonsignificant effect is observed, which signifies that document verboseness can be neglected. For the latter, in Sect. 4.3 we have observed the duality between document verboseness and document length on one side, and term burstiness and term length on the other side, and we observed the effect of these normalizations on the query side with respect to LM. Here, significant improvements are observed, however these improvements are obtained primarily by the use of term burstiness, while the term length can be neglected. In both directions improvements are observed given by the new parametrizations, and their results show a dual behavior, given by the exclusion of document verboseness in the former, and by the exclusion of term length in the latter.
In summary in this paper we have provided an exhaustive study of normalization factors in IR probabilistic models using 4 different test collections. Based on the observations made on these test collections, we have made the case that different domains, having different text statistics, can be directly factored into the existing probabilistic models. We have thus provided a quantification of the various document and term statistics into one factor that balances different prior probabilities that all these models, more or less explicitly, rely on.
Footnotes
Notes
Acknowledgements
Open access funding provided by Austrian Science Fund (FWF). This research was partly supported by the Austrian Science Fund (FWF) Project Number P25905N23 (ADmIRE). This work has been supported by the SelfOptimizer project (FFG 852624) in the EUROSTARS programme, funded by EUREKA, the BMWFW and the European Union.
References
 Amati, G., & Kerpedjiev, S. (1992). An information retrieval logic model: Implementation and experiments. Tech. Rep. REL 5b04892, Fondazione Ugo Bordoni, Rome, Italy.Google Scholar
 Amati, G., & Van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4), 357–389. https://doi.org/10.1145/582415.582416.CrossRefGoogle Scholar
 Church, K., & Gale, W. (1999). Inverse document frequency (IDF): A measure of deviations from poisson (pp. 283–295). Dordrecht: Springer. https://doi.org/10.1007/9789401723909_18.Google Scholar
 Fang, H., Tao, T., & Zhai, C. (2004). A formal study of information retrieval heuristics. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’04 (pp. 49–56). New York, NY, USA: ACM. https://doi.org/10.1145/1008992.1009004.
 Hanbury, A., & Lupu, M. (2013). Toward a model of domainspecific search. In Proceedings of the 10th conference on open research areas in information retrieval, OAIR ’13 (pp. 33–36). Paris, France: CID.Google Scholar
 HE, B., & Ounis, I. (2003). A study of parameter tuning for term frequency normalization. In Proceedings of the twelfth international conference on information and knowledge management, CIKM ’03 (pp. 10–16). New York, NY, USA: ACM. https://doi.org/10.1145/956863.956867.
 He, B., & Ounis, I. (2005a). A study of the dirichlet priors for term frequency normalisation. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’05 (pp. 465–471). New York, NY, USA: ACM. https://doi.org/10.1145/1076034.1076114.
 He, B., & Ounis, I. (2005b). Term frequency normalisation tuning for BM25 and DFR models (pp. 200–214). Heidelberg, Berlin: Springer. https://doi.org/10.1007/9783540318651_15.Google Scholar
 Knaus, D., Mittendorf, E., & Schauble, P. (1994). Improving a basic retrieval method by links and passage level evidence. In Proceedings of the 3rd text REtrieval conference (pp. 241–241).Google Scholar
 Lipani, A., Lupu, M., Hanbury, A., & Aizawa, A. (2015). Verboseness fission for bm25 document length normalization. In Proceedings of the 2015 international conference on the theory of information retrieval, ICTIR ’15 (pp. 385–388). New York, NY, USA: ACM. https://doi.org/10.1145/2808194.2809486.
 Lv, Y., & Zhai, C. (2011a). Adaptive term frequency normalization for bm25. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11 (pp. 1985–1988). New York, NY, USA: ACM. https://doi.org/10.1145/2063576.2063871.
 Lv, Y., & Zhai, C. (2011b). Lowerbounding term frequency normalization. In Proceedings of the 20th ACM international conference on information and knowledge management, CIKM ’11 (pp. 7–16). New York, NY, USA: ACM. https://doi.org/10.1145/2063576.2063584.
 Lv, Y., & Zhai, C. (2011c). When documents are very long, bm25 fails! In Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’11 (pp. 1103–1104). New York, NY, USA: ACM. https://doi.org/10.1145/2009916.2010070.
 Metzler, D. (2008). Generalized inverse document frequency. In Proceedings of the 17th ACM conference on information and knowledge management, CIKM ’08 (pp. 399–408). New York, NY, USA: ACM. https://doi.org/10.1145/1458082.1458137. http://doi.acm.org/10.1145/1458082.1458137.
 Na, S. H., Kang, I. S., & Lee, J. H. (2008). Improving term frequency normalization for multitopical documents and application to language modeling approaches (pp. 382–393). Berlin, Heidelberg: Springer. https://doi.org/10.1007/9783540786467_35.
 Robertson, S. E., & Walker, S. (1999). Okapi/keenbow at TREC8. In Proceedings of the 8th text REtrieval conference (Vol. 8, pp. 151–162).Google Scholar
 Robertson, S. E., Walker, S., Jones, S., HancockBeaulieu, M., & Gatford, M. (1994). Okapi at TREC3. In Proceedings of the 3rd text REtrieval conference (Vol. 3, pp. 109–126).Google Scholar
 Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends®in. Information Retrieval, 3(4), 333–389. https://doi.org/10.1561/1500000019.CrossRefGoogle Scholar
 Roelleke, T. (2013). Information retrieval models: Foundations and relationships. https://doi.org/10.2200/S00494ED1V01Y201304ICR027.CrossRefGoogle Scholar
 Roelleke, T., Kaltenbrunner, A., & BaezaYates, R. (2015). Harmony assumptions in information retrieval and social networks. The Computer Journal, 58(11), 2982. https://doi.org/10.1093/comjnl/bxv031.CrossRefGoogle Scholar
 Roelleke, T., & Wang, J. (2008). Tfidf uncovered: A study of theories and probabilities. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’08 (pp. 435–442). New York, NY, USA: ACM. https://doi.org/10.1145/1390334.1390409.
 Rousseau, F., & Vazirgiannis, M. (2013). Composition of tf normalizations: New insights on scoring functions for ad hoc ir. In Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, SIGIR ’13 (pp. 917–920). New York, NY, USA: ACM. https://doi.org/10.1145/2484028.2484121.
 Salton, G., & Buckley, C. (1988). Termweighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523. https://doi.org/10.1016/03064573(88)900210.CrossRefGoogle Scholar
 Singhal, A., Buckley, C., & Mitra, M. (1996). Pivoted document length normalization. In Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’96 (pp. 21–29). New York, NY, USA: ACM. https://doi.org/10.1145/243199.243206.
 Zhai, C., & Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’01 (pp. 334–342). New York, NY, USA: ACM. https://doi.org/10.1145/383952.384019.
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.