Encyclopedia of Machine Learning

2010 Edition
| Editors: Claude Sammut, Geoffrey I. Webb


Reference work entry
DOI: https://doi.org/10.1007/978-0-387-30164-8_832

TF–IDF (term frequency–inverse document frequency) is a term weighting scheme commonly used to represent textual documents as vectors (for purposes of classification, clustering, visualization, retrieval, etc.). Let T = {t1,…, tn} be the set of all terms occurring in the document corpus under consideration. Then a document di is represented by a n-dimensional real-valued vector xi  =  (xi1,…, xin) with one component for each possible term from T.

The weight xij corresponding to term tj in document di is usually a product of three parts: one which depends on the presence or frequency of tj in di, one which depends on tj’s presence in the corpus as a whole, and a normalization part which depends on dj. The most common TF–IDF weighting is defined by \({x}_{ij} ={ \mbox{ TF}}_{i} \cdot {\mbox{ IDF}}_{j} \cdot {({\sum \nolimits }_{j}{({\mbox{ TF}}_{ij}{\mbox{ IDF}}_{j})}^{2})}^{-1/2}\)

This is a preview of subscription content, log in to check access

Copyright information

© Springer Science+Business Media, LLC 2011