- 55 Downloads
KeywordsRetrieval Model Feature Weight Calculation Length Normalization Factor Query Terms Inverse Document Frequency Factor
A characteristic property of a document. Usually, a document’s terms are used as features, but virtually every measurable document property can be chosen, such as word classes, average sentence lengths, principal components of term-document-occurrence matrices, or term synonyms.
- Information need
Specifically here: A lack of information or knowledge that can be satisfied by a set of text documents.
Specifically here: A small set of terms that expresses a user’s information need.
The extent to which a document is capable to satisfy an information need. Within probabilistic retrieval models, relevance is modeled as a binary random variable.
A retrieval model provides a formal means to address (information) retrieval tasks with the aid of a computer.
A retrieval task is given if an information need is to be satisfied by exploiting an information resource. More specifically, the information need is represented as a term query provided by a user, the information resource is given in the form of a text document collection, and the solution of the retrieval task is a subset of such documents of the collection, which the user considers as relevant with respect to the query. Though a broad range of retrieval tasks can be imagined, including all kinds of multimedia queries and multimedia collections (consider for example “query by humming” or medical image retrieval), the term “retrieval model” is predominantly used in the aforementioned narrow sense. Retrieval models in this sense are based on a linguistic theory and can be considered as heuristics that operationalize the probability ranking principle (Robertson 1997): “Given a query q, the ranking of documents according to their probabilities of being relevant to q leads to the optimum retrieval performance.” The principle cannot be applied to all kinds of retrieval tasks. In comment ranking, for example, the differential information gain must be considered.
Empirical models, sometimes referred to as vector space models, focus on the document representation (Salton and McGill 1983). Both documents and queries are considered as high-dimensional vectors in the Euclidean space, whereas a compatible representation is presumed: a particular document term or query term is always mapped on the same dimension, whereas the term importance is specified by a weight. Usually, the cosine of the angle between two such vectors or simply their dot product is used to quantify their similarity; in particular, the concept of similarity is put on a level with the concept of relevance. Empirical models can be distinguished with regard to the dimensions that are considered (i.e., features that are chosen) and how these dimensions (features) are weighted.
Probabilistic models strive for an explicit modeling of the concept of relevance. Statistics comes into play in order to estimate the probability of the event that a document is relevant for a given information need. Most probabilistic models employ conditional probabilities to quantify document relevance given the occurrence of a term.
Language models are based on the idea of language generation as it is used in speech recognition systems. A language-based retrieval model is computed specifically for each document in a collection and is usually term-based. Given a query q, document ranking happens according to the generation probability of q under the language model of the respective document.
The Boolean retrieval model uses binary term weights, and a query is a Boolean expression with terms as operands. Drawbacks of the Boolean model include its simplistic weighting scheme, its restriction to exact matches, and that no document ranking is possible. The vector space model (VSM) and its variants consider documents and queries as embedded in the Euclidean space (see above). Key challenge for these kinds of models is the term weighting. Salton et al. (1975) proposed the tf · idf-scheme, which combines the term frequency tf (the number of term occurrences in a document) with the inverse document frequency idf (the inverse of the number of documents that contain this term). The latent semantic indexing (LSI) model was developed to improve query interpretation and semantic-based matching (Deerwester et al. 1990). For example, a document d should match a query even if the user specified synonyms that do not occur in d. The LSI model attempts to achieve such effects by projecting documents and queries into a so-called “semantic space,” which is constructed by a singular value decomposition of the term-document-matrix. The explicit semantic analysis (ESA) model was introduced to compute the semantic relatedness of natural language texts (Gabrilovich and Markovitch 2007). The model represents a document d as a high-dimensional vector whose dimensions quantify the pairwise similarities between d and the documents of some reference collection such as Wikipedia. Potthast et al. (2008) demonstrated how the ESA principles are applied to develop an effective cross-language retrieval approach, the so-called CL-ESA model. In contrast to most retrieval models, the suffix tree model represents a document d not as a vector of index terms but as a compressed trie containing all suffixes (i.e., suffixes of all lengths) of a text d. As a consequence, the collocation information of d is preserved, which may render the model superior for particular retrieval tasks (Meyer zu Eißen et al. 2005).
Under the binary independence model (BIM), the documents are ranked by decreasing probability of relevance (Robertson and Sparck-Jones 1976). The model is based on two assumptions which allow for a practical estimation of the required probabilities: documents and queries are represented under a Boolean model, and the terms are modeled as occurring independently of each other. The best match (BM) model computes the relevance of a document to a query based on the frequencies of the query terms appearing in the document and their inverse document frequencies (Robertson and Walker 1994). Three parameters tune the influence of the document length, the document term frequency, and the query term frequency in the model. The best match model belongs to the most effective retrieval models in the Text Retrieval Conference (TREC) series.
The language modeling approach to information retrieval was proposed by Ponte and Croft (1998); the idea is to rank documents by the generation probabilities for a given query (see above). The algorithmic core of the model is a maximum likelihood estimation of the probability of a query term under a document’s term distribution. The latent Dirichlet allocation (LDA) model is a sophisticated generative model in the context of probabilistic topic modeling (Blei et al. 2003). Under this model it is assumed that documents are composed as a mixture of latent topics, where each topic is specified as a probability distribution over words. The mixture is generated by sampling from a Dirichlet distribution. More recently, Le and Mikolov (2014) introduced paragraph vector, also known as the Doc2Vec model, which learns continuous distributed vector representations for documents using a neural network classifier.
What distinguishes retrieval models from each other is the feature set that they employ for representing queries and documents, as well as the computation rule used to calculate the respective feature weights. In the following, the feature sets and the computation rules for four retrieval models are outlined, starting from the basic tf-model to the more sophisticated models tf · idf, BM25, and ESA.
For the two parameters of the function, values of k 1 = [1.2,2.0] and b = 0.75 are considered standard choices. The two normalization factors (length normalization and term frequency normalization) are balanced by the parameter b. The last factor of the formula is the BM25 variant of the inverse document frequency.
The key application of retrieval models is to provide keyword-based search capabilities over large collections of natural language text such as digital libraries or the World Wide Web. In many practical settings, the documents of a collection are not completely unstructured but come along with designated meta data such as document titles, abstracts, or markup in the text as in the case of web pages. By taking this additional information into account, e.g., through boosting the relevance score of documents that contain the query terms in the title, the quality of the search can often be improved significantly over the use of standard retrieval models. In the field of Web search, probably the most prominent approach in this respect is the PageRank score (Page et al. 1999), which exploits the hyperlink graph of the Web for the relevance assessment of web pages. Note, however, that today the PageRank score is only one signal among several hundered other features.
Classical retrieval models provide the formal means of satisfying a user’s information need (typically a query) against a large document collection such as the Web. These models can be seen as heuristics that operationalize the probability ranking principle mentioned at the outset. Regarding future directions, a new generation of retrieval models may be capable to support information needs of the following kind: “Given a hypothesis, what is the document that provides the strongest arguments to support or attack the hypothesis?”
Obviously, the implied kind of relevance judgments cannot be made based on the classical retrieval models, as these models do not capture argument structure. In fact, so far the question of how to exploit argument structure for retrieval purposes has hardly been raised, but the research community has picked up this exciting direction (Gurevych et al. 2016).
- Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: IJCAI 2007, proceedings of the 20th international joint conference on artificial intelligence, Hyderabad, 6–12 Jan 2007, pp 1606–1611Google Scholar
- Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31th international conference on machine learning (ICML2014), Beijing, pp 1188–1196Google Scholar
- Meyer zu Eißen S, Stein B, Potthast M (2005) The suffix tree document model revisited. In: Tochtermann K, Maurer H (eds) 5th international conference on knowledge management (IKNOW 05), Know-Center, Graz. Journal of Universal Computer Science, pp 596–603Google Scholar
- Page L et al. (1999) The pagerank citation ranking: bringing order to the weGoogle Scholar
- Ponte J, Croft W (1998) A language modeling approach to information retrieval. In: SIGIR’98: proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 275–281. doi: 10.1145/290941.291008
- Potthast M, Stein B, Anderka M (2008) A Wikipedia-based multilingual retrieval model. In: Macdonald C, Ounis I, Plachouras V, Ruthven I, White R (eds) Advances in information retrieval. 30th European conference on IR research (ECIR 08). Lecture notes in computer science, vol 4956. Springer, Berlin/Heidelberg/New York, pp 522–530Google Scholar
- Robertson S (1997) The probability ranking principle in IR. Morgan Kaufmann Publishers, San FranciscoGoogle Scholar
- Robertson S, Walker S (1994) Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: SIGIR’94: proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval. Springer, New York, pp 232–241Google Scholar