A systematic approach to normalization in probabilistic models

Every information retrieval (IR) model embeds in its scoring function a form of term frequency (TF) quantification. The contribution of the term frequency is determined by the properties of the function of the chosen TF quantification, and by its TF normalization. The first defines how independent the occurrences of multiple terms are, while the second acts on mitigating the a priori probability of having a high term frequency in a document (estimation usually based on the document length). New test collections, coming from different domains (e.g. medical, legal), give evidence that not only document length, but in addition, verboseness of documents should be explicitly considered. Therefore we propose and investigate a systematic combination of document verboseness and length. To theoretically justify the combination, we show the duality between document verboseness and length. In addition, we investigate the duality between verboseness and other components of IR models. We test these new TF normalizations on four suitable test collections. We do this on a well defined spectrum of TF quantifications. Finally, based on the theoretical and experimental observations, we show how the two components of this new normalization, document verboseness and length, interact with each other. Our experiments demonstrate that the new models never underperform existing models, while sometimes introducing statistically significantly better results, at no additional computational cost.


Introduction
The development of retrieval models is one of the key aspects of research in information retrieval (IR). The IR models arise from experimental observations about the use of the language, predominantly on collections of documents primarily composed of news corpora. Today, with the almost total digitization of most text produced, it is clear that the textual documents are not just news and that different collections require different approaches (Hanbury and Lupu 2013). Consequently, the field has been driven to deal with different kinds of information types, demonstrated by the creation of new and more domain specific initiatives in the main IR evaluation campaigns: TREC, NTCIR, CLEF, and FIRE. Now, thanks to the observations made in the context of these evaluation campaigns, we are able to revisit some of the original assumptions and extend the models to integrate other collection statistics that reflect the different use of the language in different domains.
Every IR model boils down to a scoring function in which we can distinguish a component that increases with the number of occurrences of a term in a document (a term frequency component, TF ) and a component that decreases with the commonality of a term (an inverse document frequency component, IDF ). In this paper we focus on the TF component. Its normalization, first introduced by Robertson et al. (1994) for BM25, and then generalized by Singhal et al. (1996) for a generic model, consists in adjusting the withindocument term frequency ( tf d ) based on the ratio between the document length ( l d ) and its expectation ( E  [l d ] ), called pivoted document length normalization. The work of Singhal et al. is motivated by the experimental observation that the length pattern of the retrieved documents should match the pattern of the relevant documents. Robertson et al. justify this normalization, later declared as 'soft' for the mitigation effect provided by the division by the mean, by introducing two contrasting hypotheses (Robertson and Zaragoza 2009), named verboseness and multi-topicality: (a) the verboseness hypothesis states that some authors need more words to explain something that could have been explained with fewer; (b) the multi-topicality hypothesis states that the reason why more words are required is because the author has covered more ground. While the first hypothesis suggests a document should be normalized by its length, the second suggests the contrary.
Recently, Lipani et al. (2015) have brought back to the attention of the IR community this discussion, pointing out that another collection statistic could be embedded in the TF normalization of BM25. This new statistic measures a kind of verboseness, the repetitiveness of terms in a document, and leads to the achievement of performance better than the standard BM25.
In this paper we address this new observation from the perspective of the established models, and provide a new, general theory. Before doing that, a few general observations are in order.
Retrieval models combine various parameters into a score reflecting the degree to which a document implies a query. The common parameters and rationales are: tf d within-document term frequency; frequent is good P D (t|c) document-based term prob. (aka IDF(t, c) = − log(P D (t|c)) ); rare is good P (t|c) occurrence-based term probability (LM mixture) l d document length; to promote short documents where c is a collection of documents, d is a document, and t is a term. We claim that there are other properties of documents and terms that are important but under-represented, namely verboseness and the previously introduced burstiness (Roelleke 2013). In this paper we will focus primarily on verboseness, but we will also make some observations on burstiness and its relation with IDF . However, before starting, we introduce the notation used.

Notation
The basic symbols and sets are given in the following table. The notation is based on the proposal made by Roelleke (2013). However, unlike Roelleke, given that here we will not theoretically analyze different collections, we will generally drop the collection c index where convenient and not ambiguous. Based on the basic symbols, we define frequencies. Term frequencies, document frequencies, average term frequencies are ambiguous notions. It is important to clarify exactly what symbols mean. l t number of occurrences of the term t in the collection, here also called term length (aka collection frequency)  t set of documents where t occurs  d set of terms in d | t | number of documents where t occurs (aka document frequency, df(t)) | d | number of distinct terms in d l d length of document d (number of term occurrences, note l d ≥ | d |) Next, we define the four averages important for this paper. The first two combine in a systematic way the symbols of the previous table.
average frequency of term t in the documents in which the term occurs average term frequency of terms that occur in document d Note that there are two notions regarding "average term frequency", In the first case the average is performed fixing t and averaging across the documents  t containing t, and in the second case the average is performed fixing d and averaging across the terms  d contained therein.
Finally, we introduce the probabilities used in this paper.
As can be seen, in this paper, when mentioning probability (P) with no index we refer to the probability based on locations, i.e. the probability defined on the sample space of term occurrences.

Motivations
In this section we formally introduce the document verboseness and term burstiness. We then motivate their investigation in IR models.
Verboseness is reflected by the ratio l d ∕| d | : the document length divided by the number of (distinct) terms in the document. The ratio corresponds to the average tf d (over all terms) in document d: A document is verbose if few terms are repeated many times; its domain is [1, l d ] , 1 for non-verbose (no term occurs more then once), and l d for maximally verbose (one term is repeated l d times).
Intuitively, the more verbose (repetitive) a document is, the higher is the chance to find a high tf d . In other words, a document has a high score just because words are repeated (e.g. spamming), and therefore, one wants to demote verbose documents in the ranking.
Burstiness is reflected by the ratio l t ∕| t | , that is the length of the term in the collection c (or number of occurrences of the term in c) divided by the number of the collection's documents where the term t occurs (aka document frequency). The ratio corresponds to the average tf d (over the number of documents where the term t occurs) in collection c: A term is bursty if it occurs in few documents many times; its domain is [1, l t ] , 1 for a non-bursty term (it occurs only once in each document where it is present), l t for maximally bursty (all the occurrences are only in one document).
Intuitively, the more bursty a term is, the higher is the chance to find a high tf d . In other words, a bursty term occurs in fewer documents than a non-bursty (a normal) term, and therefore, one wants to promote documents containing bursty terms.
Instead of verboseness and burstiness, scoring functions most often use normalization of the tf d based on the document length l d (e.g. in the TF component of BM25 and in some versions of TF-IDF) .
The contribution of the document length is smoothed by its average, that corresponds to the average l d (over all the documents) in collection c: This is then used to calculate the pivoted document length (pivotization indicated in the paper with a hat) as follows: The ̂l d is greater than 1 for relatively long documents (greater than the average document length), and smaller than 1 for short documents (lower than the average document length).
It is surprising that IR models are keen to capture the ̂l d , but seem to hide away verboseness and burstiness, i.e. there is no parameter explicitly associated with these properties. However we observe that some IR models implicitly use these normalizations.
We investigate which IR models capture verboseness and burstiness, and how the parameters can be made explicit or added. Motivated by the work of Lipani et al. (2015), we formally justify verboseness from its duality with the document length normalization. As a supportive case we also present its duality with the concept of burstiness (Roelleke 2013), and term length (aka collection frequency).

Contributions and structure
The main contributions of this paper are: (1) The inclusion of document verboseness as an explicit parameter in TF quantifications, showing that verboseness is to be viewed in a similar way as the document length in the TF normalizations; (2) An extensive set of experiments capturing a well-defined spectrum of TF quantifications, whose results for log-based and BM25-based TF quantifications deliver a significant contribution to insights into the effect of TF quantifications, even beyond the TF normalization variants; (3) Theoretical justifications for the way document verboseness and length are combined, considering the dualities between verboseness and other parameters (including the burstiness of terms).
The remainder of the paper is structured as follows: in Sect. 2 we present the background. In Sect. 3, the main contribution of the paper, namely combining document verboseness and length into the normalization parameter K d of the TF quantification, is presented. We next review in Sect. 4 the probabilistic foundations of IR models. This highlights the role of parameters such as verboseness, burstiness and document length, and the theoretical justification of TF BM25 -IDF. In Sect. 5, we report the experimental setup and results, followed by Sect. 6 dedicated to the discussion of the results. Section 7 concludes the paper.
The discussion about the TF normalization was initiated by Robertson and Zaragoza (2009), introducing the two hypotheses: verboseness and multi-topicality and then followed by the work of Singhal et al. (1996) where the document length pivotization is justified experimentally. Not much work has been done on the multi-topicality hypothesis, but some for the verboseness hypothesis. However, the problem of how to weight terms dates back further, to the work of Salton and Buckley (1988). Na et al. (2008) introduce the concept of repetitiveness to derive a smoothing method for Language Modeling, showing an improvement with respect to other smoothing methods.
Following other work on the TF normalization issues, He and Ounis (2005a) apply the Dirichlet priors to the TF normalization following the idea of Amati and Van Rijsbergen (2002), and test it on different test collections Ounis 2003, 2005b). Lv and Zhai pointed out that the TF quantification based on document length excessively penalizes very long documents due to its lower bound, a problem mitigated by leveraging the TF normalization by adding a constant (Lv and Zhai 2011b). They also pointed out that in case of BM25 it can be mitigated by adding a constant to the TF normalization (Lv and Zhai 2011c). Rousseau and Vazirgiannis (2013) generalized the previously mentioned TF normalizations through functional composition. Lv and Zhai (2011a) estimate dynamically the parameter k 1 of BM25, based on a proposed information gain measure. Lipani et al. (2015) introduce a new variant of BM25, called BM25VA that explicitly incorporates verboseness. This is the main work that motivates this paper. The verboseness is defined as in Eq. (1), and pivoted as Verboseness is then added to the TF BM25 , linearly combining the two contributions through the parameter b, as follows: In this work, it is heuristically shown that the parameter b is inversely proportional to a statistic of the collection, the average collection verboseness E  [v d ] , and that it can be predicted without statistically damaging the performance of the trained BM25.
Another way of approaching the length normalization issue is to consider retrieval of the the individual passages (Robertson and Walker 1999). However, this use of passages to address length normalization is theoretically unjustified and introduces a series of decision points (size and nature of passages) that are not the focus of this current study.

TF normalisations
Before getting into the details of the duality between document verboseness and length, it is necessary to formally define the current pivotization of document length and introduce the pivotization of verboseness. To do this we start from the foundation of every IR model: the document-term matrix A ∈ ℕ ||×| | , in which each element is a tf d indicated here by a d,t for convenience of the notation. For any given matrix, we can define two ways to sum the elements of this matrix; one that fixes a column (a term t) and sums over the rows (the || documents) and one that fixes a row (a document d) and sums over the columns (the | | . Doing this we calculate two lengths: the length of a term 1 and the length of a document, as follows: Now, if we want to compute the average of the values on each row or column, we have to divide the sums obtained above by a value. For this value we actually have two options: the number of columns or rows, and the number of non-zero elements in the columns or rows. The first is what we would call the average, and the second the elite average. To give an intuition, think of the question "What is the average number of Ferraris owned by a person?". This question has two answers: we can divide the total number of Ferraris (the sum of the elements on a row/column) by the total number of people on the planet (the number of columns/rows); or, we can consider only those people that have at least one Ferrari and then divide the number of Ferraris by the size of this set of people. The first one is the common average, while the second, obviously, is the elite average.
Returning to our document-term matrix, we will denote by a bar ( ā ) a common average and by a breve (ȃ ) an elite average: in which we observe that the two elite averages just defined ȃ t and ȃ d correspond to the burstiness b t as defined in Eq. (2) and the verboseness v d as defined in Eq. (1).
Considering the remaining elements, ā t , ȃ t , ā d and ȃ d , we can think of them as defining an average document d = [ā t 1 …ā t | | ] , an elite average document d = [ȃ t 1 …ȃ t | | ] , an average term t = [ā d 1 …ā d || ] , and an elite average term t = [ȃ d 1 …ȃ d || ] . Moreover, we observe also that the elite average document is equal to d = [b t 1 … b t | | ] and the elite average term is So, now, for each row d and for each column t we have a sum, an average, and an elite average. To obtain a collection-level statistic, we have to aggregate again, calculating sums and averages (common and elite averages are identical now, because all rows and all columns have a non-zero aggregated value).
Doing so, we observe that i.e. the average document length ̄l d is equal to the sum of the elements of the average document d .
However, the same observation is not valid for verboseness, because it is an elite average. Instead, we have two notations: A graphical representation of the calculations performed in this section is shown in Fig. 1.

Duality: document verboseness and length
Recalling the definition of verboseness from Eq. (1), it is the average number of times a document's term occurs within the document. To observe the duality of document verboseness, Eq. (3), let us first define the notation to identify the singleton of a document d ∈  as  d = {d} and the singleton of a term t ∈  as  t = {t} . Obviously | d | = | t | = 1 and therefore we can write l d = l d ∕| d | . Let us now consider the pivoted verboseness and pivoted document length, using the two sets of values defined above: ̄l d =l d , v d and v d : where we indicate the non-elite pivotization with a double dots and the elite pivotization with a hat. The duality is obtained substituting The pivoted verboseness of a document is with respect to the space of terms (  ), whereas the pivoted document length of a document is with respect to the space of documents (  ). One can also show the duality between document verboseness and length based on probabilistic expressions: P L (d) is the location based probability of a document. Dividing this by the term based probability of d, P T (d) = | d |∕| | yields the pivoted verboseness. Dividing by the document based probability of d, P D (d) = | d |∕|| = 1∕|| , yields the pivoted document length.
The dualities between average document verboseness and average document length justify the combination of parameters as formalized in the definition capturing the normalization variants of K d : d : the non-elite normalization comprises the non-elite pivots ̈l d and v d . K d : the elite normalization comprises the elite pivots ̂l d and v d . The expression pivdl , pivoted document length, denotes one of the two: Analogously for pivdv , pivoted document verboseness.
Then, the pivotization components are defined for the disjunctive (linear) and conjunctive (product) combination of the pivots.
where the two parameters b and a are both defined in [0, 1]. The parameter b controls the degree of normalization between full normalization (when b = 1 ) and no normalization (when b = 0 ), and the parameter a controls the balance between the contributions of pivdl and pivdv . The combination of these pivots becomes part of the usual definition of the normalization parameter K d .
where the parameter k 1 , which is defined in ]0, ∞[ , controls the power of the normalization.
It is worth pointing out now that for b = 0 , or b = 1 and a = {0, 1} these two combinations are the same. In particular we should note that: which is the "traditional" K d , created ignoring both document verboseness and length ( b = 0).
To summarize, there are four variants of the pivotization factor K d : non-elite disjunctive denoted as K ∨ , non-elite conjunctive denoted as K ∧ , and the respective elite variants K ∨ and K ∧ . The experiments emphasize the analysis of the behavior of these four variants.

Example of calculation of the pivotizations
The next example illustrates the arithmetic to compute the pivoted document verboseness and length.
Example 1 (Pivoted Document Verboseness and Length) Assume a document d with l d = 300 word occurrences, and | d | = 150 distinct words. The verboseness is: Let the collection contain l c = 10 7 word occurrences, and | | = 10 5 distinct words. The non-elite average document verboseness is 100, that is, in average, a term occurs v d = 100.
The elite average verboseness is the average over the verboseness values of the documents. For example, let v d = 5∕2 be the elite verboseness.
The pivoted verboseness is the verboseness divided by the average verboseness, e.g. the non-elite average verboseness: while the pivoted elite verboseness is the verboseness divided by the elite average verboseness: Regarding the document length, let ̄l d = 400 be the average document length. Then, the pivoted document length is: Then we can combine the non-elite pivots, for example, in a disjunctive way: or, the elite pivots in a conjunctive way: The other two variants, elite pivots combined in a disjunctive way ( K ∨,d ), and non-elite pivots combined in a conjunctive way ( K ∧,d ) are left to the reader.

Other dualities
To strengthen the theoretical justifications, we explore two other dualities, namely the duality between document verboseness and term burstiness, and later in the section the duality between term burstiness and term length. Here, the definitions of the first couple: The duality is obtained substituting  →  and d → t to go from v d to b t or  →  and t → d to go from b t to v d . Verboseness is the average term frequency when considering the document length l d over the set  d of terms that occur in the respective document. Burstiness is the average term frequency when considering the number of times the term occurs l t over the set  t of documents in which the respective term occurs.
Furthermore, starting from burstiness and substituting  →  , we observe another duality, between term length and burstiness: These dualities, based fundamentally on substitutions between the set of documents  and the set of terms  , were briefly explored in the early 1990s, when Knaus et al. (1994), and Amati and Kerpedjiev (1992) talked about ITF (inverse term frequency) and IDF. IDF later generalized by Metzler (2008).
Whereas the IDF is applied for reasoning about the similarity between documents, the ITF is applied for reasoning about the similarity between terms. Viewing the ITF and IDF together, by looking at the denominator's argument of the logarithms, shows that ITF is related to verboseness, and IDF is related to burstiness.
Overall, the discussion supports the case to consider verboseness as a documentspecific parameter, whereas traditional IR focuses on the pivoted document length only.

Summary
This section justified the systematic combination of pivoted document length and pivoted verboseness, while placing them in the context of other dualities, involving burstiness and term length. Table 1 shows the list of all the explored dualities.

Probabilistic derivation of IR models
To discuss the justification of TF quantifications, we consider the probabilistic derivation of IR models. Most IR models can be derived from measuring the dependence between document and query. Let d denote a document, q a query, and c a collection. The document-query independence (DQI Roelleke and Wang 2008) is the point-wise mutual information expressed as: Document and query are considered as sequences of term events. The decomposition of d leads to TF-IDF (and, for particular assumptions, to BM25), and the decomposition of q leads to LM. In this section we review the decomposition of d. When decomposing d using P(d, q) = P(d|q)P(q) and then P(d) = ∏ t∈ d P(t) tf d and P(d�q) = ∏ t∈ d P(t�q) tf d , we obtain: Here, P(t|q) is the query term probability, and P(t) is the background model (collectionwide) term probability. The equation makes two independence assumptions: different terms are independent, and also, the multiple occurrences of the same term are independent. The first assumption is reflected in applying the sum over different terms, and the second assumption is reflected by the total term frequency count, tf d .
To provide a justification for TF-IDF, one is looking for the bridges to close the gap between the probabilistic roots (assuming independence) and the TF-IDF. Expressed as an equation, we are looking for justifications to transform components of Eq. (25) to TF-IDF.
where TF and IDF are the two components, term frequency and inverse document frequency.

Observations about the TF component
The within-document term frequency ( tf d ) in IR models is usually not used pure due to its bias towards long documents as motivated in Sect. 2. The step from tf d towards a quantification function involves a normalization component, referred to as K d . The widely known TF BM25 normalization factor is: Given that k 1 and b are parameters of K d , one should use the notation K k 1 ,b,d , but for readability, we simplify the notation to K d .
The following definition formalizes the well-defined spectrum of TF quantifications (Roelleke et al. 2015).
The shape of the different TF quantifications is shown in Fig. 2. This spectrum is welldefined because each of these TF s correspond to an assumption regarding term events (Roelleke et al. 2015). TF total corresponds to assuming independence, and the TF log and TF BM25 variants assume the occurrences of an event to be dependent.
With this understanding of what the TF stands for, namely a factor modeling a dependence assumption, the role of K d is to tune the dependence assumption. For K d > 1 , that is for long documents, TF(t, d) decreases, i.e. the dependence increases. This means that in long documents, the multiple term occurrences are more dependent than in short documents. This makes perfect sense when imagining a long document that repeats some terms many times.
This discussion makes evident that it is not just the length of the document that matters. To illustrate, consider two documents of equal length, for example, l d = 300 words. The standard K d will be equal for both documents. One document, however, contains many repetitions of some words (the document is verbose), whereas the other document contains many different words (the document is not verbose). Indeed, it is the verboseness and not simply the document length that leads to high term frequencies, and thus, to dependencies of multiple term occurrences. Therefore, this paper views K d as a combination of the pivoted document length ( pivdl ) and the pivoted document verboseness ( pivdv).
The following equation indicates the difference between the standard K d as known for BM25 [as shown in Eq. (26)], and the systematic extension proposed and investigated in this paper: Here, f (pivdl, pivdv) is a function combining the two parameters, and this paper explores both a conjunctive and a disjunctive combination.

Observations about the IDF component
Regarding TF BM25 -IDF , the question remains of how to close the gap between P(t|q)/P(t) and IDF , as commonly defined in the literature: IDF(t) = 1∕P D (t) . Mathematically, we are looking for a justification that leads to the following equation: where in order to avoid confusion in the next derivation steps the collection symbol c is made explicit. We note that P(t|c) and P D (t|c) are both in the denominators of the functions. Let us consider what the relation between these two elements is, i.e. P(t|c)∕P D (t|c) . Referring back to the notations introduced at the end of Sect. 1.1, we have: and, substituting in the left side of (29), it becomes: This equation makes burstiness explicit, and in particular its otherwise implicit role in the relationship between IDF and the probabilistic model. If we were to return to Eq. (29), we are forced to consider: Essentially, we have observed that the IDF, in its generic form of 1∕P D (t|c) implies that, when the term is not part of the query q, we estimate P(t|q) as the probability of the term in the collection (P(t|c)) and when the term is part of q we estimate it as P(t|q) = b t ∕l d . This separation between the cases when t ∈  q and t ∉  q is reminiscent of smoothing in language modeling. We could for instance write with We shall call this an extreme mixture.
If we were to continue this inspiration from language modeling, leaving the above for a moment aside, to compute the P(t|q, c) we would estimate it through a linear mixture between the P(t|c) and the P(t|q), as follows: This equation is traditionally made because to estimate the probability of a term given the query q, when q is short, is not reliable (even more so than when considering a document d).
Substituting Eq. (36) into Eq. (32), we have: where P(t|q) is calculated in a traditional way with a maximum likelihood estimator. However, this would not solve our problem given by the shortness of q. Instead, we need to use the estimation of Eq. (34). Then, reintroducing the distinction between t ∈  q and t ∉  q (i.e. q ), we obtain In which if we set q = 1 then the foreground probability P(t|c) cancels out from the linear mixture assumption ending up with the standard IDF . We shall call this inverse document frequency IDF L , where L stands for linear mixture, in contrast to the standard IDF (or IDF E ) that is defined by an extreme mixture.

LM and TF-IDF
We already reached with our analysis a point where the border between LM and TF-IDF gets blurred. In this section we discuss the derivation of the LM model and highlight some commonality with the derivation of TF-IDF done in the previous section. We remember that the discussion of IDF in TF BM25 -IDF was started from Eq. (24), where we decomposed P(d, q) = P(d|q)P(q) . Here we can review the decomposition of q as P(d, q) = P(q|d)P(d) .
We will then have: P(q�d) = ∏ t∈ q P(t�d) tf q , and: Using again the observation formalized in Eq. (31), we observe the explicit presence of burstiness in the following equation, as it was in Eq. (32): Analogously for the derivation of TF-IDF for the estimation of P(t|q, c) in Eq. (36), and as commonly done in language modeling, we estimate the P(t|d, c) as: and substituting to Eq. (40) we obtain: We can now notice the symmetry with Eq. (37). In LM, when applying a Dirichlet-based mixture (D-LM), the value of d is Zhai and Lafferty (2001): where is a parameter of the collection. This parameter could be set based on the average documents length ̄l d . Zhai and Lafferty (2001) report values of ≈ 2000 , though they note that the range of optimal parameter values in different collections is quite large (500-10,000). Later, Fang et al. (2004) posited that needs to be at least as large as the average document length ( ̄l d ), so a reasonable value form for d is: Now, just as we did for the normalization of TF in the TF-IDF derivation, we should consider here not only the presence of the document length but also that of verboseness: In a symmetric way we may define for TF-IDF a parameter not strongly dependent by the presence or absence of the term in q (as it was the case in the extreme mixture observed in the previous section) but rather using the Dirichlet based smoothing approach and the maximum likelihood estimation for P(t|q) = tf q ∕l q : However, the components of this formulation for q are generally not very informative (queries tend to be significantly shorter than documents, and therefore we cannot really talk about the verboseness of a query). Instead, at this place we can exploit the duality of document verboseness and length with term length and burstiness (see Sect. 3.3): In summary, in this section we have explored the relationship between TF-IDF and LM. Both models apply a mixture: TF-IDF for estimating P(t|q, c), and LM for estimating P(t|d, c). Moreover, both models involve the component b t ∕l d ⋅ P D (t) measuring the discriminativeness of the term, where burstiness is made explicit.
The mixture assumption for P(t|q, c) leads to IDF and it becomes clear why IDF is seen as capturing burstiness in an "implicit" way (Church and Gale 1999). The Dirichlet-based mixture for P(t|d, c), usually only associated with the document length, is extended with the document verboseness. This extension is done analogously to the way the TF quantification has been extended for the TF-IDF models.

Experiments
In this section, we first present the material, then the experimental setup. Finally we discuss the results.

Setup and materials
To test the TF normalization variants on the different kinds of TF quantifications, we used 4 test collections: TREC HARD 2005, TREC Ad Hoc 8, CLEF eHealth 2014, and TREC Web 2002. Details and corpora properties shown in Table 2. The test collections have been purposefully chosen with a high degree of variability of v d . In this way we can observe the different use of the language in different domains (e.g. we observe that in .GOV on average a term is repeated 218% more times than in the Aquaint collection). We developed 2 the tested IR models on the IR platform Terrier 3 4.2. All the documents have been preprocessed using the English tokenizer and Porter stemmer of the Terrier search engine. All the topics, when multiple lengths are available in the test collections, are of the shortest kind.
We tested a total of 24 models: -16 models based on TF-IDF variants: 4 TF normalizations for each of the 4 TF quantifications defined in Definition 2. Each model is identified by its TF quantification, TF total , TF log , TF BM25 , and TF constant and kind of TF normalization applied: non-elite disjunctive K ∨,d , non-elite conjunctive K ∧,d , elite disjunctive K ∨,d and elite conjunctive K ∧,d . -4 models based on D-LM: Each Dirichlet-based mixture is identified by its kind of d normalization applied: non-elite disjunctive ̈∨ ,d , non-elite conjunctive ̈∧ ,d , elite disjunctive ̂∨ ,d and elite conjunctive ̂∧ ,d .
-4 models based on the TF-IDF L : Each Dirichlet-based mixture is identified by its kind of q normalization applied: non-elite disjunctive ̈∨ ,q , non-elite conjunctive ̈∧ ,q , elite disjunctive ̂∨ ,q and elite conjunctive ̂∧ ,q . As TF component, we select the non-normalized TF total .
The TF normalization of each model presents 3 parameters: k 1 , b and the new a introduced in this paper. The D-LM and TF-IDF L based models present 2 parameters: b and a. Our experiments focus on the parameter a. For k 1 and b, there are two ways of selecting their values: using the standard values from the literature, or identifying trained values. For the models based on the TF-IDF variants, the standard parameters for TF BM25 are k 1 = 1.2 and b = 0.7 (Robertson et al. 1994). The standard parameter for TF total and TF constant is b = 0 that simplifies K d to a constant. In this case we set k 1 = 1 , because it is easy to demonstrate that to change the parameter k 1 , as long as k 1 > 0 , does not change the rank of the retrieved documents for these two quantifications. The same set of parameter values are set for the standard TF log ( b = 0 , k 1 = 1 ). For the models based on the D-LM, the standard parameters are k 1 = 1 and b = 0 , which reduces to the standard definition of D-LM (Zhai and Lafferty 2001). For the models based on the LM variant derived by TF-IDF, the standard parameters are k 1 = +∞ , which reduces to the standard TF-IDF model with non normalized TF total quantification.
To identify trained values, the parameters of each model have been spanned as follows: a, b ∈ [0, 1] at steps of 0.1, and k 1 ∈ [0, 5] , from 0 to 1 at steps decided by the function 1 / n with n ∈ {1, ..., 50} , and from 1 to 5 at steps of 0.1. The trained values are obtained maximizing the mean over the topics of the selected evaluation measure. For every model's configuration that requires training we perform a fivefold cross validation.
The IR evaluation measures employed are AP , NDCG and P@10.

Model candidates/structure
Each TF-IDF model candidate is characterized by choosing one of the following options: 1. Pivotization: elite pivotization or non-elite pivotization for document verboseness and length; 2. Normalization: conjunctive ( ∧ ) or disjunctive ( ∨ ) combination of pivoted document verboseness and length into K d ; 3. Quantification: TF total , TF log , TF BM25 , or TF constant ; 4. Parameter Settings: standard (S) or trained (T) parameters.
Each D-LM model candidate is characterized by choosing one of the following options: 1. Pivotization: elite pivotization or non-elite pivotization for document verboseness and length; 2. Normalization: conjunctive ( ∧ ) or disjunctive ( ∨ ) combination of pivoted document verboseness and length into d ; 3. Parameter Settings: standard (S) or trained (T) parameters.
Each TF-IDF L model candidate is characterized by choosing one of the following options: 1. Pivotization: elite pivotization or non-elite pivotization for term length and burstiness; 2. Normalization: conjunctive ( ∧ ) or disjunctive ( ∨ ) combination of pivoted term length and burstiness into q ; 3. Parameter Settings: standard (S) or trained (T) parameters.

Results
The main results observed are: 1. Document Verboseness versus Length: show a certain independence as shown by the shape of the distributions in Fig. 3; 2. Pivotization: for TF-IDF models the elite pivotization is overall better than the non-elite one; for D-LM models the non-elite pivotization performs better. 3. Normalization: for TF-IDF models the combination of document verboseness and length achieves significantly better results, especially when combined in a conjunctive fashion; for D-LM models the combination of document verboseness and length rarely achieves statistically significance;

TF-Quantification
: TF BM25 appears best, with TF log close behind; 5. Standard versus Trained parameter: in both parameter configurations, standard and trained, the use of verboseness makes the model achieve better results. On the other hand, the use of term length most of the time has a negligible impact. Column K indicates if standard (S) or trained (T) parameters are used. † indicates statistical significance (paired t-test, p < 0.05 ) against the standard and ‡ against the trained parameters when a is not used   Table 3, Ad Hoc 8 in Table 4, eHealth 2014 in Table 5, and Web 2002 in Table 6, we present the results obtained with the TF-IDF model variants and the two pivotizations. In these tables we observe each model with either its standard configuration (S), or its trained configuration (T), obtained taking the configuration that maximizes the evaluation measure AP . The standard parameters of the normalizations for the TF quantifications: TF total , TF log and TF constant , have the effect of disabling the Table 6 Comparison of the scores obtained with the TF-IDF model candidates with each TF normalization  using the non-elite and elite pivotization for the Web 2002 test collection Column K indicates if standard (S) or trained (T) parameters are used. † indicates statistical significance (paired t-test, p < 0.05 ) against the standard and ‡ against the trained parameters when a is not used

For each test collections: HARD 2005 in
. However, for TF BM25 this does not happen. Thereby, we can study the effect of the parameter a in its standard parametrization. To do this we extract the best value obtained with the standard k 1 and b by selecting the maximum value of the measure AP obtained by varying the parameter a. In case of the trained parameter values instead, for all the TF quantifications, we show in the first row the best result obtained maximizing the AP without the use of verboseness in the scoring function ( a = 1 ), and then we show the result obtained when verboseness is added in the scoring function. The tables distinguish between the conjunctive ( ∧ ) and disjunctive ( ∨ ) combinations of document verboseness and length. TF BM25 works generally better than the other TF quantifications, but not for all test collections. For the test collection eHealth 2014 TF log is better.
We also observe that best configuration is achieved using the elite pivotization. The conjunctive combination works generally better than the disjunctive case (24 of 32 experiments better than the disjunctive, all 7 unfavorable cases occur when using the Web 2002 test collection).
In Table 7, we present the results obtained for every test collections using D-LM with d extended with verboseness. For this model the standard parameter is when b = 1 , and a = 0 , which reduces the formula to the standard D-LM without verboseness (cit ealtZhai:2001:SSM:383952.384019). This variant is shown on the first row for every test collection. The subsequent rows present the variant of d when combined with verboseness in disjunction and conjunction with non-elite and elite pivots. For this model we observe that the presence of verboseness produces for only one test collection significant improvements. Overall we observe that the non-elite pivotization should be preferred (all the experiments produce better results than the elite one). No difference is observed by using a disjunctive or conjunctive combination of the pivots.
In Table 8, we present the results obtained for every test collections using TF-IDF L model with q that combines in a LM fashion the term length and burstiness. For this model the standard parameter is when q = 1 , which reduces this IR model to a non TF-normalized TF total -IDF model. This variant is shown on the first row for every test collection. The following rows present the variant of q when combined in disjunction and conjunction with non-elite and elite pivots. We observe that this parametrization produces significantly better results than the standard case, and that the non-elite parametrization should be preferred. Also here, as for D-LM, no difference is observed by using a disjunctive or conjunctive combination of the pivots. We also observe that overall the values of the trained parameter a is often equal to 1, which suggests that, for these model variants, the term length does not play an important role in adjusting the document's score. This is a curious behavior since it is dual to the D-LM model, where the document verboseness does not play an important role either. Finally, in Tables 9 and 10 we present the results of the fivefold cross validation for all the trained cases of the TF-IDF models, in the first table, and the D-LM and TF-IDF L models, in the second table.

Analysis and discussion
Finally we make some observations across the experimental results about the behavior of the parameter a. Before that however, let us make an observation on the nature of the data at our disposal. Figure 3 shows the distribution of the document verboseness versus document length for the elite and non-elite pivotizations. In both cases we see that verboseness brings additional information compared to document length: the plotted distributions are well spread, away from the first diagonal.
Comparing the two distributions, it is interesting to observe that the non-elite pivotization is significantly more skewed than the elite one: the x-axis of the left plot has a scale Table 9 Fivefold cross validation of the trained TF-IDF models candidates observed in Tables 3, 4 in the (0, 0.02) range, while the one on the right plot has a scale that matches the y-scale: (0, 4). This supports and grounds our hypothesis that elite pivotization should provide us better means to balance verboseness and document length with parameter a. The a parameter controls the contribution of elite pivoted verboseness and elite pivoted document length. When a < 0.5 , the contribution of the document verboseness is higher than the contribution of the document length, and vice versa when a > 0.5 . Looking at the distribution for the elite pivotizations of the documents, redefining the origin to the point (1, 1) we split the distributions in four quadrants. 4 We know that whatever a we fix, the documents in the I quadrant will be always demoted to some degree, and in the III quadrant the documents will be always promoted to some degree. So here the question is what happens to the documents in the IV and II quadrant. When to be preferred is the contribution of document verboseness ( a > 0.5 ) more documents with low verboseness ( v d < 1 ) and high length ( ̂l d > 1 ) will be promoted against the documents of the IV quadrant, and when preferred is the contribution of the document length ( a < 0.5 ) the contrary happens. Therefore, the a values, previously listed, should anti-correlate with the ratio of the number of relevant documents between the II quadrant and the IV quadrant. Here the two lists of values sorted by test collection, of a extracted from Tables 3, 4, 5, and 6, for the standard BM25 case with trained a: 0.8, 0.6, 0.4, and 0.0 and ratios: 0.63, 0.86, 1.16 and 4.20, where we observe that they anti-correlate. Therefore if we think that all the documents of the collection should be relevant we should find the a value that mostly balance the proportion of non verbose but long documents with the short but verbose documents. All the test collections but Disks 4&5 have been crawled from the Web. For all of them we can observe that the plots manifest a visible noise. In particular we observe the presence of black dots that are most probably caused by the existance of duplicated documents in the collections.   '05   556  581  595  552  551  562  589  554  568  598  596  570  580  577  576  587  558  591  560  559  590  573  579  582  588  583  555  578  574  557  597  584  571  569  585  599  600  567  575  594  564  561  586  565  572  563  553  566  593  592   17  21  36  10  38  11  15  5  23  35  49  37  27  9  14  44  28  43  50  12  46  7  25  2  39  33  42  40  26  3  48  1  34  41  22  4  8  32  30  31  16  24  19  18  29  45  20  47  6  13   420  412  444  413  430  419  425  427  429  416  445  438  431  428  411  407  440  437  409  422  436  446  423  408  435  417  403  443  415  401  405  424  448  432  421  449  433  404  442  426  450  414  439  410  402  434  418  447  406  441   435  330  427  433  622  336  325  419  443  638  401  354  341  394  374  367  383  375  363  448  378  314  372  408  397  625  650  322  689  353  651  307  439  639  344  393  347  345  404  409  399  389  416  310  426  303  436  658  362  For example, the existance of duplicated documents in the e-Health'14 test collection is a known issue to the e-Health IR community.
In Tables 3, 4, 5, and 6 we observe that the best performing configuration, for both TF log and TF total , uses the trained parameters combined in disjunction, in particular in Table 4 these configurations also show statistical significance against both standard configuration and trained configuration when verboseness is not present ( a = 0 ). The elite pivotization performs generally better than the non-elite pivotization. In particular the best performing configurations are with elite pivotization and trained parameters in conjunction. We observe also that in general the elite pivotization weighting role is taken by the parameter a ( b = 1 means that a full document verboseness and length normalization is applied).
In Fig. 4 we further analyze the best configuration on a per topic basis. Here, we show the difference in AP between the AP of the trained TF BM25 -IDF with verboseness combined in conjunction with elite pivots, and the trained classic TF BM25 -IDF. If the difference is positive the variant with verboseness is better than the classic version.

Conclusion
This paper presents an extensive study of TF quantifications and normalizations. The quantifications are with respect to a well-defined spectrum comprising TF total , TF log , TF BM25 , and TF constant . Each of these TF quantifications reflects a dependence assumption. In particular, TF total and TF constant are the extremes of the quantification spectrum, assuming independence for the former and subsumption for the latter. TF BM25 is a relatively strong dependence assumption, and TF log is in the middle between TF total and TF BM25 . Each of these quantifications incorporates a TF normalization parameter, usually denoted as K d .
Whereas current approaches regarding K d consider only the document length as parameter of K d , this paper makes the case for K d to be a combination of document verboseness and length. There are many heuristic options for how to combine the parameters, and this paper contributes the theoretical foundations leading to a systematic combination of document verboseness and length.
The paper reports results of an experimental study investigating the effect of various settings of K d for the four main TF quantifications. The overall finding is that combining document verboseness with document length (either in a conjunctive or disjunctive way) improves retrieval quality when compared to results considering document length only.
We expand this in two directions, first by exploring a similar normalization in the context of LM and second a similar normalization in the context of TF-IDF. For the former, we include document verboseness into the Dirichlet smoothing where non-significant effect is observed, which signifies that document verboseness can be neglected. For the latter, in Sect. 4.3 we have observed the duality between document verboseness and document length on one side, and term burstiness and term length on the other side, and we observed the effect of these normalizations on the query side with respect to LM. Here, significant improvements are observed, however these improvements are obtained primarily by the use of term burstiness, while the term length can be neglected. In both directions improvements are observed given by the new parametrizations, and their results show a dual behavior, given by the exclusion of document verboseness in the former, and by the exclusion of term length in the latter.
In summary in this paper we have provided an exhaustive study of normalization factors in IR probabilistic models using 4 different test collections. Based on the observations made on these test collections, we have made the case that different domains, having different text statistics, can be directly factored into the existing probabilistic models. We have thus provided a quantification of the various document and term statistics into one factor that balances different prior probabilities that all these models, more or less explicitly, rely on.