A probabilistic justification for using tf×idf term weighting in information retrieval

Hiemstra, Djoerd

doi:10.1007/s007999900025

A probabilistic justification for using tf×idf term weighting in information retrieval

Natural language processing for digital libraries
Published: August 2000

Volume 3, pages 131–139, (2000)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Djoerd Hiemstra¹

1267 Accesses
119 Citations
3 Altmetric
Explore all metrics

Abstract.

This paper presents a new probabilistic model of information retrieval. The most important modeling assumption made is that documents and queries are defined by an ordered sequence of single terms. This assumption is not made in well-known existing models of information retrieval, but is essential in the field of statistical natural language processing. Advances already made in statistical natural language processing will be used in this paper to formulate a probabilistic justification for using tf×idf term weighting. The paper shows that the new probabilistic interpretation of tf×idf term weighting might lead to better understanding of statistical ranking mechanisms, for example by explaining how they relate to coordination level ranking. A pilot experiment on the TREC collection shows that the linguistically motivated weighting algorithm outperforms the popular BM25 weighting algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Author information

Authors and Affiliations

Centre for Telematics and Information Technology, University of Twente, The Netherlands; E-mail: hiemstra@ctit.utwente.nl, , , , , , NL
Djoerd Hiemstra

Authors

Djoerd Hiemstra
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Received: 17 December 1998 / Revised: 31 May 1999

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hiemstra, D. A probabilistic justification for using tf×idf term weighting in information retrieval . Int J Digit Libr 3, 131–139 (2000). https://doi.org/10.1007/s007999900025

Download citation

Issue Date: August 2000
DOI: https://doi.org/10.1007/s007999900025

Key words: Information retrieval theory – Statistical information retrieval – Statistical natural language processing

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A probabilistic justification for using tf×idf term weighting in information retrieval

Abstract.

Access this article

Similar content being viewed by others

Term frequency with average term occurrences for textual information retrieval

A systematic approach to normalization in probabilistic models

Improving Information Retrieval Through a Global Term Weighting Scheme

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Navigation

A probabilistic justification for using tf×idf term weighting in information retrieval

Abstract.

Access this article

Similar content being viewed by others

Term frequency with average term occurrences for textual information retrieval

A systematic approach to normalization in probabilistic models

Improving Information Retrieval Through a Global Term Weighting Scheme

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation