# TF–IDF

**DOI:**https://doi.org/10.1007/978-0-387-30164-8_832

TF–IDF (*term frequency–inverse document frequency*) is a term weighting scheme commonly used to represent textual documents as vectors (for purposes of classification, clustering, visualization, retrieval, etc.). Let *T* = {*t*_{1},…, *t*_{n}} be the set of all terms occurring in the document corpus under consideration. Then a document *d*_{i} is represented by a *n*-dimensional real-valued vector **x**_{i} = (*x*_{i1},…, *x*_{in}) with one component for each possible term from *T*.

The weight *x*_{ij} corresponding to term *t*_{j} in document *d*_{i} is usually a product of three parts: one which depends on the presence or frequency of *t*_{j} in *d*_{i}, one which depends on *t*_{j}’s presence in the corpus as a whole, and a normalization part which depends on *d*_{j}. The most common TF–IDF weighting is defined by \({x}_{ij} ={ \mbox{ TF}}_{i} \cdot {\mbox{ IDF}}_{j} \cdot {({\sum \nolimits }_{j}{({\mbox{ TF}}_{ij}{\mbox{ IDF}}_{j})}^{2})}^{-1/2}\)