Abstract
A document is usually the highest linguistic unit of natural language. Document representation aims to encode the semantic information of the whole document into a realvalued representation vector, which could be further utilized in downstream tasks. Recently, document representation has become an essential task in natural language processing and has been widely used in many documentlevel realworld applications such as information retrieval and question answering. In this chapter, we first introduce the onehot representation for documents. Next, we extensively introduce topic models that learn the topic distribution of words and documents. Further, we give an introduction to distributed document representation, including paragraph vector and neural document representations. Finally, we introduce several typical realworld applications of document representation, including information retrieval and question answering.
Download chapter PDF
5.1 Introduction
Advances in information and communication technologies offer ubiquitous access to vast amounts of information and are causing an exponential increase in the number of documents available online. While more and more textual information is available electronically, effective retrieval and mining are getting more and more difficult without the efficient organization, summarization, and indexing of document content. Therefore, document representation is playing an important role in many realworld applications, e.g., document retrieval, web search, and spam filtering. Document representation aims to represent document input into a fixedlength vector, which could describe the contents of the document, to reduce the complexity of the documents and make them easier to handle. Traditional document representation models such as onehot document representation have achieved promising results in many document classification and clustering tasks due to their simplicity, efficiency, and often surprising accuracy.
However, the onehot document representation model has many disadvantages. First, it loses the word order, and thus, different documents can have the same representation, as long as the same words are used. Second, it usually suffers data sparsity and high dimensionality. Onehot document representation model has very little sense about the semantics of the words or, more formally, the distances between the words. Hence, the approach for representing text documents uses multiword terms as vector components, which are noun phrases extracted using a combination of linguistic and statistical criteria. This representation is motivated by the notion of topic models that terms should contain more semantic information than individual words. And another advantage of using terms for representing a document is its lower dimensionality compared with the traditional onehot document representation.
Nevertheless, applying these to generation tasks remains difficult. To understand how discourse units are connected, one has to understand the communicative function of each unit, and the role it plays within the context that encapsulates it, recursively all the way up for the entire text. Identifying increasingly sophisticated humandeveloped features may be insufficient for capturing these patterns, but developing representationbased alternatives has also been difficult. Although document representation can capture aspects of coherent sentence structure, it is not clear how it could help in generating more broadly cohesive text.
Recently, neural network models have shown compelling results in generating meaningful and grammatical documents in sequence generation tasks like machine translation or parsing. It is partially attributed to the ability of these systems to capture local compositionally: the way neighboring words are combined semantically and syntactically to form meanings that they wish to express. Based on neural network models, many research works have developed a variety of ways to incorporate documentlevel contextual information. These models are all hybrid architectures in that they are recurrent at the sentence level, but use a different structure to summarize the context outside the sentence. Furthermore, some models explore multilevel recurrent architectures for combining local and global information in language modeling.
In this chapter, we first introduce the onehot representation for documents. Next, we extensively introduce topic models that aim to learn latent topic distributions of words and documents. Further, we give an introduction on distributed document representation including paragraph vector and neural document representations. Finally, we introduce several typical realworld applications of document representations, including information retrieval and question answering.
5.2 OneHot Document Representation
Majority of machine learning algorithms take a fixedlength vector as the input, so documents are needed to be represented as vectors. The bagofwords model is the most common and simple representation method for documents. Similar to onehot sentence representation, for a document \(d = \{w_1, w_2, \ldots , w_l\}\), a bagofword representation \(\mathbf {d}\) can be used to represent this document. Specifically, for a vocabulary \({V} = [w_1, w_2, \dots , w_{{V}}]\) , the onehot representation of word w is \( \mathbf {w} = [0, 0, \dots , 1, \dots , 0]\). Based on the onehot word representation and a vocabulary V, it can be extended to represent a document as
where l is the length of the document d. And similar to onehot sentence representation, the TFIDF method is also proposed to enhance the ability of bagofwords representation in reflecting how important a word is to a document in a corpus.
Actually, the bagofwords representation is mainly used as a tool of feature generation, and the most common type of features calculated from this method is word frequency appearing in the documents. This method is simple but efficient and sometimes can reach excellent performance in many realworld applications. However, the bagofwords representation still ignores entirely the word order information, which means different documents can have the same representation as long as the same words are used. Furthermore, bagofwords representation has little sense about the semantics of the words or, more formally, the distances between words, which means this method cannot utilize rich information hidden in the word representations.
5.3 Topic Model
As our collective knowledge continues to be digitized and stored in the form of news, blogs, web pages, scientific articles, books, images, audio, videos, and social networks, it becomes more difficult to find and discover what we are looking for. We need new computational tools to help organize, search, and understand these vast amounts of information.
Right now, we work with online information using two main tools—search and links. We type keywords into a search engine and find a set of documents related to them. We look at the documents in that set, possibly navigating to other linked documents. This is a powerful way of interacting with our online archive, but something is missing.
Imagine searching and exploring documents based on the themes that run through them. We might “zoom in” and “zoom out” to find specific or broader themes; we might look at how those themes changed through time or how they are connected. Rather than finding documents through keyword search alone, we might first find the theme that we are interested in, and then examine the documents related to that theme.
For example, consider using themes to explore the complete history of the New York Times. At a broad level, some of the themes might correspond to the sections of the newspaper, such as foreign policy, national affairs, and sports. We could zoom in on a theme of interest, such as foreign policy, to reveal various aspects of it, such as Chinese foreign policy, the conflict in the Middle East, and the United States’ relationship with Russia. We could then navigate through time to reveal how these specific themes have changed, tracking, for example, the changes in the conflict in the Middle East over the last 50 years. And, in all of this exploration, we would be pointed to the original articles relevant to the themes. The thematic structure would be a new kind of window through which to explore and digest the collection.
But we do not interact with electronic archives in this way. While more and more texts are available online, we do not have the human power to read and study them to provide the kind of browsing experience described above. To this end, machine learning researchers have developed probabilistic topic modeling, a suite of algorithms that aim to discover and annotate vast archives of documents with thematic information. Topic modeling algorithms are statistical methods that analyze the words of the original texts to explore the themes that run through them, how those themes are connected, and how they change over time. Topic modeling algorithms do not require any prior annotations or labeling of the documents. The topics emerge from the analysis of the original texts. Topic modeling enables us to organize and summarize electronic archives at a scale that would be impossible by human annotation.
5.3.1 Latent Dirichlet Allocation
A variety of probabilistic topic models have been used to analyze the content of documents and the meaning of words. Hofmann first introduced the probabilistic topic approach to document modeling in his Probabilistic Latent Semantic Indexing method (pLSI). The pLSI model does not make any assumptions about how the mixture weights are generated, making it difficult to test the generalization ability of the model to new documents. Thus, Latent Dirichlet Allocation (LDA) was extended from this model by introducing a Dirichlet prior to the model. LDA is believed as a simple but efficient topic model. We first describe the basic ideas of LDA [6].
The intuition behind LDA is that documents exhibit multiple topics. LDA is a statistical model of document collections that tries to capture this intuition. It is most easily described by its generative process, the imaginary random process by which the model assumes the documents arose.
We formally define a topic to be a distribution over a fixed vocabulary. We assume that these topics are specified before any data has been generated. Now for each document in the collection, we generate the words in a twostage process.

1.
Randomly choose a distribution over topics.

2.
For each word in the document,

Randomly choose a topic from the distribution over topics in step #1.

Randomly choose a word from the corresponding distribution over the vocabulary.

This statistical model reflects the intuition that documents exhibit multiple topics. Each document exhibits the topics with different proportions (step #1); each word in each document is drawn from one of the topics (step #2b), where the selected topic is chosen from the perdocument distribution over topics (step #2a).
We emphasize that the algorithms have no information about these subjects and the articles are not labeled with topics or keywords. The interpretable topic distributions arise by computing the hidden structure that likely generated the observed collection of documents.
5.3.1.1 LDA and Probabilistic Models
LDA and other topic models are part of the broader field of probabilistic modeling. In generative probabilistic modeling, we treat our data as arising from a generative process that includes hidden variables. This generative process defines a joint probability distribution over both the observed and hidden random variables. Given the observed variables, we perform data analysis by using that joint distribution to compute the conditional distribution of the hidden variables. This conditional distribution is also called the posterior distribution.
LDA falls precisely into this framework. The observed variables are the words of the documents, the hidden variables are the topic structure, and the generative process is as described above. The computational problem of inferring the hidden topic structure from the documents is the problem of computing the posterior distribution, the conditional distribution of the hidden variables given the documents.
We can describe LDA more formally with the following notation. The topics are \(\beta _{1:K}\), where each \(\beta _k\) is a distribution over the vocabulary. The topic proportions for the dth document are \(\theta _d\), where \(\theta _{dk}\) is the topic proportion for topic k in document d. The topic assignments for the dth document are \(z_d\), where \(z_{d,n}\) is the topic assignment for the nth word in document d. Finally, the observed words for document d are \(w_d\), where \(w_{d,n}\) is the nth word in document d, which is an element from the fixed vocabulary.
With this notation, the generative process for LDA corresponds to the following joint distribution of the hidden and observed variables:
Notice that this distribution specifies the number of dependencies. For example, the topic assignment \(z_{d,n}\) depends on the perdocument topic proportions \(\theta _d\). As another example, the observed word \(w_{d,n}\) depends on the topic assignment \(z_{d,n}\) and all of the topics \(\beta _{1:K}\).
These dependencies define LDA. They are encoded in the statistical assumptions behind the generative process, in the particular mathematical form of the joint distribution, and in a third way, in the probabilistic graphical model for LDA. Probabilistic graphical models provide a graphical language for describing families of probability distributions. The graphical model for LDA is in Fig. 5.1. Each node is a random variable and is labeled according to its role in the generative process. The hidden nodes, the topic proportions, assignments, and topics are unshaded. The observed nodes and the words of the documents, are shaded. We use rectangles as plate notation to denote replication. The N plate denotes the collection of words within documents; the D plate denotes the collection of documents within the collection. These three representations are equivalent ways of describing the probabilistic assumptions behind LDA.
5.3.1.2 Posterior Computation for LDA
We now turn to the computational problem, computing the conditional distribution of the topic structure given the observed documents. (As we mentioned above, this is called the posterior.) Using our notation, the posterior is
The numerator is the joint distribution of all the random variables, which can be easily computed for any setting of the hidden variables. The denominator is the marginal probability of the observations, which is the probability of seeing the observed corpus under any topic model. In theory, it can be computed by summing the joint distribution over every possible instantiation of the hidden topic structure.
Topic modeling algorithms form an approximation of the above equation by forming an alternative distribution over the latent topic structure that is adapted to be close to the true posterior. Topic modeling algorithms generally fall into two categories: samplingbased algorithms and variational algorithms.
Samplingbased algorithms attempt to collect samples from the posterior by approximating it with an empirical distribution. The most commonly used sampling algorithm for topic modeling is Gibbs sampling, where we construct a Markov chain, a sequence of random variables, each dependent on the previous—whose limiting distribution is posterior. The Markov chain is defined on the hidden topic variables for a particular corpus, and the algorithm is to run the chain for a long time, collect samples from the limiting distribution, and then approximate the distribution with the collected samples.
Variational methods are a deterministic alternative to samplingbased algorithms. Rather than approximating the posterior with samples, variational methods posit a parameterized family of distributions over the hidden structure and then find the member of that family that is closest to the posterior. Thus, the inference problem is transformed into an optimization problem. Variational methods open the door for innovations in optimization to have a practical impact on probabilistic modeling.
5.3.2 Extensions
The simple LDA model provides a powerful tool for discovering and exploiting the hidden thematic structure in large archives of text. However, one of the main advantages of formulating LDA as a probabilistic model is that it can easily be used as a module in more complicated models for more complex goals. Since its introduction, LDA has been extended and adapted in many ways.
5.3.2.1 Relaxing the Assumptions of LDA
LDA is defined by the statistical assumptions it makes about the corpus. One active area of topic modeling research is how to relax and extend these assumptions to uncover a more sophisticated structure in the texts.
One assumption that LDA makes is the bagofwords assumption that the order of the words in the document does not matter. While this assumption is unrealistic, it is reasonable if our only goal is to uncover the coarse semantic structure of the texts. For more sophisticated goals, such as language generation, it is patently not appropriate. There have been many extensions to LDA that model words nonexchangeable. For example, [59] develops a topic model that relaxes the bagofwords assumption by assuming that the topics generate words conditional on the previous word; [22] develops a topic model that switches between LDA and a standard HMM. These models expand the parameter space significantly but show improved language modeling performance.
Another assumption is that the order of documents does not matter. Again, this can be seen by noticing that Eq. 5.3 remains invariant to permutations of the ordering of documents in the collection. This assumption may be unrealistic when analyzing longrunning collections that span years or centuries. In such collections, we may want to assume that the topics change over time. One approach to this problem is the dynamic topic model [5], a model that respects the ordering of the documents and gives a more productive posterior topical structure than LDA.
The third assumption about LDA is that the number of topics is assumed known and fixed. The Bayesian nonparametric topic model provides an elegant solution: The collection determines the number of topics during posterior inference, and new documents can exhibit previously unseen topics. Bayesian nonparametric topic models have been extended to hierarchies of topics, which find a tree of topics, moving from more general to more concrete, whose particular structure is inferred from the data [4].
5.3.2.2 Incorporating MetaData to LDA
In many text analysis settings, the documents contain additional information such as author, title, geographic location, links, and others that we might want to account for when fitting a topic model. There has been a flurry of research on adapting topic models to include metadata.
The authortopic model [51] is an early success story for this kind of research. The topic proportions are attached to authors; papers with multiple authors are assumed to attach each word to an author, drawn from a topic drawn from his or her topic proportions. The authortopic model allows for inferences about authors as well as documents.
Many document collections are linked. For example, scientific papers are linked by citations, or web pages are connected by hyperlinks. And several topic models have been developed to account for those links when estimating the topics. The relational topic model of [9] assumes that each document is modeled as in LDA and that the links between documents depend on the distance between their topic proportions. This is both a new topic model and a new network model. Unlike traditional statistical models of networks, the relational topic model takes into account node attributes in modeling the links.
Other work that incorporates metadata into topic models includes models of linguistic structure [8], models that account for distances between corpora [60], and models of named entities [42]. Generalpurpose methods for incorporating metadata into topic models include Dirichletmultinomial regression models [39] and supervised topic models [37].
5.3.2.3 Acceleration
In the existing fast algorithms, it is difficult to decouple the access to \( C_{d}\) and \( C_{w}\) because both counts need to be updated instantly after the sampling of every token. Many algorithms have been proposed to accelerate LDA based on this equation. WarpLDA [13] is built based on a new Monte Carlo Expectation Maximization (MCEM) algorithm, which is similar to CGS, but both counts are fixed until the sampling of all tokens is finished. This scheme can be used to develop a reordering strategy to decouple the accesses to \( C_d\) and \( C_w\), and minimize the size of randomly accessed memory.
Specifically, WarpLDA seeks a MAP solution of the latent variables \( \varTheta \) and \( \varPhi \), with the latent topic assignments Z integrated out: where \(\alpha ^\prime \) and \(\beta ^\prime \) are the Dirichlet hyperparameters. Reference [2] has shown that this MAP solution is almost identical with the solution of CGS, with proper hyperparameters.
Computing \(\log P(\varTheta , \varPhi  W, \alpha ', \beta ')\) directly is expensive because it needs to enumerate all the K possible topic assignments for each token. We, therefore, optimize its lower bound as a surrogate. Let Q(Z) be a variational distribution. Then, by Jensen’s inequality, the lower bound can be \(\mathscr {J}( \varTheta , \varPhi , Q( Z))\):
An Expectation Maximization (EM) algorithm is implemented to find a local maximum of the posterior \(P( \varTheta , \varPhi  W, \alpha ^{\prime }, \beta ^{\prime })\), where the Estep maximizes \(\mathscr {J}\) with respect to the variational distribution Q(Z) and the Mstep maximizes \(\mathscr {J}\) with respect to the model parameters \(( \varTheta , \varPhi )\), while keeping Q(Z) fixed. One can prove that the optimal solution at Estep is \(Q( Z) = P( Z  W, \varTheta , \varPhi )\) without further assumption on Q. We apply Monte Carlo approximation on the expectation in Eq. 5.4,
where \( Z^{(1)}, \dots , Z^{(S)} \sim Q( Z)=P( Z  W, \varTheta , \varPhi ).\) The sample size is set as \(S=1\) and the model uses Z as an abbreviation of \( Z^{(1)}\).
Sampling Z: Each dimension of Z can be sampled independently:
Optimizing \( \varTheta , \varPhi \): With the Monte Carlo approximation, we have
and with the optimal solutions, we have
Instead of computing and storing \(\hat{ \varTheta }\) and \(\hat{ \varPhi }\), we compute and store \( C_{d}\) and \( C_{w}\) to save memory because the latter are sparse. Plug Eqs. 5.8–5.6, and let \( \alpha = \alpha ^\prime 1, \beta =\beta ^\prime 1\), we get the full MCEM algorithm, which iteratively performs the following two steps until a given iteration number is reached:

Estep: We can sample \(z_{d, n}\sim Q(z_{d, n}=k)\) according to
$$\begin{aligned} Q(z_{d, n}=k)\propto (C_{dk} + \alpha _k)\frac{C_{wk} + \beta _w}{C_k + \bar{\beta }} . \end{aligned}$$(5.9) 
Mstep: Compute \( C_{d}\) and \( C_{w}\) by Z.
Note the resemblance intuitively justifies why MCEM leads to similar results with CGS. The difference between MCEM and CGS is that MCEM updates the counts \( C_d\) and \( C_w\) after sampling all \(z_{d, n}\)s, while CGS updates the counts instantly after sampling each \(z_{d, n}\). The strategy that MCEM updates the counts after sampling all \(z_{d, n}\)s is called delayed count update, or simply delayed update. MCEM can be viewed as a CGS with a delayed update, which has been widely used in other algorithms [1, 41]. While previous work uses the delayed update as a trick, we at this moment present a theoretical guarantee to converge to a MAP solution. The delayed update is essential for us to decouple the accesses of \( C_d\) and \( C_w\) to improve cache locality, without affecting the correctness.
5.4 Distributed Document Representation
To address the disadvantages of bagofwords document representation, [31] proposes paragraph vector models, including the version with Distributed Memory (PVDM) and the version with Distributed BagofWords (PVDBOW). Moreover, researchers also proposed several hierarchical neural network models to represent documents. In this section, we will introduce these models in detail.
5.4.1 Paragraph Vector
As shown in Fig. 5.2, paragraph vector maps every paragraph to a unique vector, represented by a column in the matrix \(\mathbf {P}\) and maps every word to a unique vector, represented by a column in word embedding matrix \(\mathbf {E}\). The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context. More formally, compared to the word vector framework, the only change in this model is in the following equation, where h is constructed from \(\mathbf {E}\) and \(\mathbf {P}\).
where h is constructed by the concatenation or average of word vectors extracted from \(\mathbf {E}\) and \(\mathbf {P}\).
The other part of this model is that given a sequence of training words \(w_1\), \(w_2\), \(w_3\), ..., \(w_l\), the objective of the paragraph vector model is to maximize the average log probability:
And the prediction task is typically done via a multiclass classifier, such as softmax. Thus, the probability equation is
The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the current context, or the topic of the paragraph. For this reason, this model is often called the Distributed Memory Model of Paragraph Vectors (PVDM).
The above method considers the concatenation of the paragraph vector with the word vectors to predict the next word in a text window. Another way is to ignore the context words in the input, but force the model to predict words randomly sampled from the paragraph in the output. In reality, what this means is that at each iteration of stochastic gradient descent, we sample a text window, then sample a random word from the text window and form a classification task given the Paragraph Vector. This technique is shown in Fig. 5.3. This version is named the Distributed BagofWords version of Paragraph Vector (PVDBOW), as opposed to the Distributed Memory version of Paragraph Vector (PVDM) in the previous section.
In addition to being conceptually simple, this model requires to store fewer data. The data only needed to be stored is the softmax weights as opposed to both softmax weights and word vectors in the previous model. This model is also similar to the Skipgram model in word vectors.
5.4.2 Neural Document Representation
In this part, we introduce two main kinds of neural networks for document representation including documentcontext language model and hierarchical document autoencoder.
5.4.2.1 DocumentContext Language Model
Recurrent architectures can be used to combine local and global information in document language modeling. The simplest such model would be to train a single RNN, ignoring sentence boundaries as mentioned above; the last hidden state from the previous sentence \(t1\) is used to initialize the first hidden state in sentence t. In such an architecture, the length of the RNN is equal to the number of tokens in the document; in typical genres such as news texts, this means training RNNs from sequences of several hundred tokens, which introduces two problems: (1) Information decay In a sentence with thirty tokens (not unusual in news text), the contextual information from the previous sentence must be propagated through the recurrent dynamics thirty times before it can reach the last token of the current sentence. Meaningful documentlevel information is unlikely to survive such a long pipeline. (2) Learning It is notoriously difficult to train recurrent architectures that involve many time steps. In the case of an RNN trained on an entire document, backpropagation would have to run over hundreds of steps, posing severe numerical challenges.
To address these two issues, [28] proposes to use multilevel recurrent structures to represent documents, thereby successfully efficiently leveraging documentlevel context in language modeling. They first proposed ContexttoContext DocumentContext Language Model (ccDCLM), which assumes that contextual information from previous sentences needs to be able to “shortcircuit” the standard RNN, so as to more directly impact the generation of words across longer spans of text. Formally, we have
where l is the length of sentence \(t1\). The ccDCLM model then creates additional paths for this information to impact each hidden representation in the current sentence t. Writing \(\mathbf {w}_{t,n}\) for the word representation of the nth word in the tth sentence, we have
where \({g}_{\theta }(\cdot )\) is the activation function parameterized by \(\theta \) and \({f}(\cdot )\) is a function that combines the context vector with the input \(\mathbf {x}_{t,n}\) for the hidden state. Here we simply concatenate the representations,
The emission probability for \(\mathbf {y}_{t,n}\) is then computed from \(\mathbf {h}_{t,n}\) as in the standard RNNLM. The underlying assumption of this model is that contextual information should impact the generation of each word in the current sentence. The model, therefore, introduces computational “shortcircuits” for crosssentence information, as illustrated in Fig. 5.4.
Besides, they also proposed ContexttoOutput DocumentContext Language Model (coDCLM). Rather than incorporating the document context into the recurrent definition of the hidden state, the coDCLM model pushes it directly to the output, as illustrated in Fig. 5.5. Let \(\mathbf {h}_{t,n}\) be the hidden state from a conventional RNNLM of sentence t,
Then, the context vector \(\mathbf {c}_{t1}\) is directly used in the output layer as
5.4.2.2 Hierarchical Document Autoencoder
Reference [33] also proposes hierarchical document autoencoder to represent documents. The model draws on the intuition that just as the juxtaposition of words creates a joint meaning of a sentence, the juxtaposition of sentences also creates a joint meaning of a paragraph or a document.
They first obtain representation vectors at the sentence level by putting one layer of LSTM (denoted as \({\text {LSTM}}_{{encode}}^{{word}}\)) on top of its containing words:
The vector output at the ending time step is used to represent the entire sentence as
To build representation \(e_D\) for the current document/paragraph, another layer of LSTM (denoted as \({\text {LSTM}}_{{encode}}^{{sentence}}\)) is placed on top of all sentences, computing representations sequentially for each time step:
Representation \(h_{{end_D}}^s\) computed at the final time step is used to represent the entire document: \(\mathbf {d}=h_{{end_D}}^s\).
Thus one LSTM operates at the token level, leading to the acquisition of sentencelevel representations that are then used as inputs into the second LSTM that acquires documentlevel representations, in a hierarchical structure.
As with encoding, the decoding algorithm operates on a hierarchical structure with two layers of LSTMs. LSTM outputs at sentence level for time step t are obtained by
The initial time step \(h_0^s(d)=e_D\), the endtoend output from the encoding procedure \(h_{t}^s(d)\) is used as the original input into \({\text {LSTM}}_{{decode}}^{word}\) for subsequently predicting tokens within sentence \(t+1\). \({\text {LSTM}}_{{decode}}^{word}\) predicts tokens at each position sequentially, the embedding of which is then combined with earlier hidden vectors for the next timestep prediction until the \(end_s\) token is predicted. The procedure can be summarized as follows:
During decoding, \({\text {LSTM}}_{{decode}}^{word}\) generates each word token w sequentially and combines it with earlier LSTMoutputted hidden vectors. The LSTM hidden vector computed at the final time step is used to represent the current sentence.
This is passed to \({\text {LSTM}}_{{decode}}^{sentence}\), combined with \(h_{t}^s\) for the acquisition of \(h_{t+1}\), and outputted to the next time step in sentence decoding. For each time step t, \({\text {LSTM}}_{{decode}}^{sentence}\) has to first decide whether decoding should proceed or come to a full stop: we add an additional token \({end}_D\) to the vocabulary. Decoding terminates when token \({end}_D\) is predicted. Details are shown in Fig. 5.6.
Attention models adopt a lookback strategy by linking the current decoding stage with input sentences in an attempt to consider which part of the input is most responsible for the current decoding state (Fig. 5.7).
Let \(H=\{h_1^s(e), h_2^s(e), \ldots , h^s_{N}(e)\}\) be the collection of sentencelevel hidden vectors for each sentence from the inputs, outputted from \({\text {LSTM}}_{{encode}}^{{sentence}}\). Each element in H contains information about input sequences with a strong focus on the parts surrounding each specific sentence (time step). During decoding, suppose that \(e_{t}^s\) denotes the sentencelevel embedding at current step and that \(h_{t1}^s(\text {dec})\) denotes the hidden vector outputted from \({\text {LSTM}}_{decode}^{sentence}\) at previous time step \(t1\). Attention models would first link the currentstep decoding information, i.e., \(h_{t1}^s(\text {dec})\) which is outputted from \({\text {LSTM}}_{dec}^{sentence}\) with each of the input sentences \(i\in [1, N]\), characterized by a strength indicator \(v_i\):
where \(\mathbf {W}_1, \mathbf {W}_2\in \mathbb {R}^{K\times K}\), \(\mathbf {U}\in \mathbb {R}^{K\times 1}\). \(v_i\) is then normalized
The attention vector is then created by averaging weights over all input sentences:
5.5 Applications
In this section, we will introduce several applications on document level analysis based on representation learning.
5.5.1 Neural Information Retrieval
Information retrieval aims to obtain relevant resources from a largescale collection of information resources. As shown in Fig. 5.8, given the query “Steve Jobs” as input, the search engine (a typical application of information retrieval) provides relevant web pages for users. Traditional information retrieval data consists of search queries and document collections D. And the ground truth is available through explicit human judgments or implicit user behavior data such as clickthrough rate.
For the given query q and document d, traditional information retrieval models estimate their relevance through lexical matches. Neural information retrieval models pay more attention to garner the query and document relevance from semantic matches. Both lexical and semantic matches are essential for neural information retrieval. Thriving from neural network black magic, it helps information retrieval models catch more sophisticated matching features and have achieved the state of the art in the information retrieval task [17].
Current neural ranking models can be categorized into two groups: representationbased and interactionbased [23]. The earlier works mainly focus on representationbased models. They learn good representations and match them in the learned representation space of queries and documents. Interactionbased methods, on the other hand, model the querydocument matches from the interactions of their terms.
5.5.1.1 RepresentationBased Neural Ranking Models
The representationbased methods directly match the query and documents by learning two distributed representations, respectively, and then compute the matching score based on the similarity between them. In recent years, several deep neural models have been explored based on such Siamese architecture, which can be done by feedforward layers, convolutional neural networks, or recurrent neural networks.
Reference [26] proposes Deep Structured Semantic Models (DSSM) first to hash words to the lettertrigrambased representation. And then use a multilayer fully connected neural network to encode a query (or a document) as a vector. The relevance between the query and document can be simply calculated with the cosine similarity. Reference [26] trains the model by minimizing the crossentropy loss on clickthrough data where each training sample consists of a query q, a positive document \(d^+\), and a uniformly sampled negative document set \(D^\):
where \(D={d^+} \cup D^\).
Furthermore, CDSSM [54] and ARCI [25] utilize convolutional neural network (CNN), while LSTMRNN [44] adopts recurrent neural network with Long ShortTerm Memory (LSTM) units to represent a sentence better. Reference [53] also comes up with a more sophisticated similarity function by leveraging additional layers of the neural network.
5.5.1.2 InteractionBased Neural Ranking Models
The interactionbased neural ranking models learn wordlevel interaction patterns from querydocument pairs, as shown in Fig. 5.9. And they provide an opportunity to compare different parts of the query with different parts of the document individually and aggregate the partial evidence of relevance. ARCII [25] and MatchPyramind [45] utilize convolutional neural network to capture complicated patterns from wordlevel interactions. The Deep Relevance Matching Model (DRMM) uses pyramid pooling (histogram) to summarize the wordlevel similarities into ranking models [23]. There are also some works establishing positiondependent interactions for ranking models [27, 46].
Kernelbased Neural Ranking Model (KNRM) [66] and its convolutional version ConvKNRM [17] achieve the state of the art in neural information retrieval. KNRM first establishes a translation matrix \(\mathbf {M}\) in which each element \(\mathbf {M}_{ij}\) is the cosine similarity of ith word in q and jth word in d. Then KNRM utilizes kernels to convert translation matrix \(\mathbf {M}\) to ranking features \(\phi (\mathbf {M})\) :
Each RBF kernel \(K_k\) calculates how word pair similarities are distributed:
Then the relevance of q and d is calculated by a ranking layer:
where \(\mathbf {w}\) and b are trainable parameters.
Reference [66] trains the model by minimizing pairwise loss on clickthrough data:
For the given query q, \(D^{+,}\) are the pairwise preferences from the ground truth. \(d^+\) and \(d^\) are two documents such that \(d^+\) is more relevant with q than \(d^\). ConvKNRM extends KNRM to model ngram semantic matches based on the convolutional neural network which can leverage snippet information.
5.5.1.3 Summary
Representationbased models and interactionbased models extract match features from overall and local aspects, respectively. They can also be combined for further improvements [40].
Recently, largescale knowledge graphs such as DBpedia, Yago, and Freebase have emerged. Knowledge graphs contain human knowledge about realworld entities and become an opportunity for search systems to understand queries and documents better. The emergence of largescale knowledge graphs has motivated the development of entityoriented search, which brings in entities and semantics from the knowledge graphs and has dramatically improved the effectiveness of featurebased search systems.
Entityoriented search and neural ranking models push the boundary of matching from two different perspectives. Reference [36] incorporates semantics from knowledge graphs into the neural ranking, such as entity descriptions and entity types. This work significantly improves the effectiveness and generalization ability of interactionbased neural ranking models. However, how to fully leverage semistructured knowledge graphs and establish semantic relevance between queries and documents remains an open question.
Information retrieval has been widely used in many natural language processing tasks such as reading comprehension and question answering. Therefore, it is no doubt that neural information retrieval will lead to a new tendency for these tasks.
5.5.2 Question Answering
Question Answering (QA) is one of the most important tasks and so are documentlevel applications in NLP. Many efforts have been invested in QA, especially in machine reading comprehension and opendomain QA. In this section, we will introduce the advances in these two tasks, respectively.
5.5.2.1 Machine Reading Comprehension
As shown in Fig. 5.10, machine reading comprehension aims to determine the answer a to the question q given a passage p. The task could be viewed as a supervised learning problem: given a collection of training examples \(\{(p_i, q_i, a_i)\}_{i=1}^n\), we want to learn a mapping \(f(\cdot )\) that takes the passage \(p_i\) and corresponding question \(q_i\) as inputs and outputs \(\hat{a}_i\), where \(evaluate(\hat{a}_i,a_i)\) is maximized. The evaluation metric is typically correlated with the answer type, which will be discussed in the following.
Generally, the current machine reading comprehension task could be divided into four categories depending on the answer types according to [10], i.e., cloze style, multiple choices, span prediction, and freeform answer.
The cloze style task such as CNN/Daily Mail [24] consists of fillintheblank sentences where the question contains a placeholder to be filled in. The answer a is either chosen from a predefined candidate set A or from the vocabulary V. The multiplechoice task such as RACE [30] and MCTest [50] aims to select the best answer from a set of answer choices. It is typical to use accuracy to measure the performance on these two tasks: the percentage of correctly answered questions in the whole example set, since the question could be either correctly answered or not from the given hypothesized answer set.
The span prediction task such as SQuAD [49] is perhaps the most widely adopted task among all, since it takes compromises between flexibility and simplicity. The task is to extract a most likely text span from the passage as the answer to the question, which is usually modeled as predicting the start position \(idx_{start}\) and end position \(idx_{end}\) of the answer span. To evaluate the predicted answer span \(\hat{a}\), we typically use two evaluation metrics proposed by [49]. Exact match assigns full score 1.0 to the predicted answer span \(\hat{a}\) if it exactly equals the ground truth answer a, otherwise 0.0. F1 score measures the degree of overlap between \(\hat{a}\) and a by computing a harmonic mean of the precision and recall.
The freeform answer task such as MS MARCO [43] does not restrict the answer form or length and is also referred to as generative question answering. It is practical to model the task as a sequence generation problem, where the discrete tokenlevel prediction was made. Currently, a consensus on what is the ideal evaluation metrics has not been achieved. It is common to adopt standard metrics in machine translation and summarization, including ROUGE [34] and BLEU [57].
As a critical component in the question answering system, the surging neuralbased machine reading comprehension models have greatly boosted the task of question answering in the last decades.
The first attempt [24] to apply neural networks on machine reading comprehension constructs bidirectional LSTM reader models along with attention mechanisms. The work introduces two reader models, i.e., the attentive reader and the impatient reader, as shown in Fig. 5.11. After encoding the passage and the query into hidden states using LSTMs, the attentive reader computes a scalar distribution s(t) over the passage tokens and uses it to compute the weighted sum of the passage hidden states r. The impatient reader extends this idea further by recurrently updating the weighted sum of passage hidden states after it has seen each query token.
The attention mechanisms used in reading comprehension could be viewed as a variant of Memory Networks [64]. Memory Networks use longterm memory units to store information for inference dynamically. Typically, given an input x, the model first converts it into an internal feature representation F(x). Then, the model can update the designated memory units \(m_i\) given the new input: \(m_i=g(m_i, F(x), m)\), or generate output features o given the new input and the memory states: \(o=f(F(x), m)\). Finally, the model converts the output into the response with the desired format: \(r=R(o)\). The key takeaway of Memory Networks is the retaining and updating of some internal memories that captivate global information. We will see how this idea is further extended in some sophisticated models.
It is no doubt that the application of attention to machine reading comprehension greatly promotes researches in this field. Following [11], the work [24] modifies the method to compute attention and simplify the prediction layer in the attentive reader. Instead of using \(tanh(\cdot )\) to compute the relevance between the passage representations \(\{\tilde{\mathbf {p}_i}\}_{i=1}^n\) and the query hidden state \(\mathbf {q}\) (see Eq. 5.33), Chen et al. use the bilinear terms to directly capture the passagequery alignment (see Eq. 5.34).
Most machine reading comprehension models follow the same paradigm to locate the start and endpoint of the answer span. As shown in Fig. 5.12, while encoding the passage, the model retains the length of the sequence and encodes the question into a fixedlength hidden representation \(\mathbf {q}\). The question’s hidden vector is then used as a pointer to scan over the passage representation \(\{\mathbf {p}_i\}_{i=1}^n\) and compute scores on every position in the passage. While maintaining this similar architecture, most machine reading comprehension models vary in the interaction methods between the passage and the question. In the following, we will introduce several classic reading comprehension architectures that follow this paradigm.
First, we introduce BiDAF, which is short for BiDirectional Attention Flow [52]. The BiDAF network consists of the token embedding layer, the contextual embedding layer, the bidirectional attention flow layer, the LSTM modeling layer, and the softmax output layer, as shown in Fig. 5.13.
The token embedding layer consists of two levels. First, the character embedding layer encodes each word in character level by adopting a 1D convolutional neural network (CNN). Specifically, for each word, characters are embedded into fixedlength vectors, which are considered as 1D input for CNNs. The outputs are then maxpooled along the embedding dimension to obtain a single fixedlength vector. Second, the word embedding layer uses pretrained word vectors, i.e., GloVe [47], to map each word into a highdimensional vector directly.
Then the concatenation of the two vectors is fed into a twolayer Highway Network [56]. Equation 5.35 shows one layer of the highway network used in the paper, where \(H_1(\cdot )\) and \(H_2(\cdot )\) represent two affine transformations:
After feeding the context and the query to the token embedding layer, we obtain \(\mathbf {X}\in \mathbb {R}^{d\times T}\) for the context and \(\mathbf {Q}\in \mathbb {R}^{d\times J}\) for the query, respectively. Afterward, the contextual embedding layer, which is a bidirectional LSTM, model the temporal interaction between words for both the context and the query.
Then, come to the attention flow layer. In this layer, the attention dependency is computed in both directions, i.e., the contexttoquery (C2Q) attention and the querytocontext (Q2C) attention. For both kinds of attention, we first compute a similarity matrix \(\mathbf {S}\in \mathbb {R}^{T\times J}\) using the contextual embeddings of the context \(\mathbf {H}\) and the query \(\mathbf {U}\) obtained from the last layer (Eq. 5.37). In the equation, \(\alpha (\cdot )\) computes the scalar similarity of the given two vectors and \(\mathbf {m}\) is a trainable weight vector.
where \(\odot \) indicates elementwise product.
For the C2Q attention, a weighted sum of contextual query embeddings is computed given each context word. The attention distribution over the query is obtained by \(\mathbf {a}_j={\text {Softmax}}(\mathbf {S}_{j,:})\in \mathbb {R}^{J}\). The final attended query vector is therefore \(\tilde{\mathbf {U}}_{:,t}=\sum _ja_{tj}\mathbf {U}_{:,j}\) for each context word.
For the Q2C attention, the context embeddings are merged into a single fixed length hidden vector \(\tilde{\mathbf {h}}\). The attention distribution over the context is computed by \(\mathbf {b}_t={\text {Softmax}}(\max _j \mathbf {S}_{tj})\), and \(\tilde{\mathbf {h}}=\sum _t \mathbf {b}_t\mathbf {H}_{:,t}\). Lastly, the merged context embeddings are tiled T times along the column to produce \(\tilde{\mathbf {H}}\).
Finally, the attended outputs are combined to yield \(\mathbf {G}\), which is defined by Eq. 5.39
Afterward, the LSTM modeling layer takes \(\mathbf {G}\) as input and encodes it using a twolayer bidirectional LSTM. The output \(\mathbf {M}\in \mathbb {R}^{2d\times T}\) is combined with \(\mathbf {G}\) to yield the final start and end probability distributions over the passage.
where \(\mathbf {u}_1, \ \mathbf {u}_2\) are two trainable weight vectors.
To train the model, the negative log likelihood loss is adopted and the goal is to maximize the probability of the golden start index \(idx_{start}\) and end index \(idx_{end}\) being selected by the model,
Besides BiDAF, where attention dependencies are computed in two directions, we will also briefly introduce other interaction methods between the query and the passage. The GatedAttention Reader proposed by [19] adopts the gated attention module, where each token representation of the passage \(d_i\) is scaled by the attended query vector \(\mathbf {Q}\) after each BiGRU layer (Eq. 5.44).
This gated attention mechanism allows the query to directly interact with the token embeddings of the passage at the semantic level. And such layerwise interaction enables the model to learn conditional token representation given the question at different representation levels.
The AttentionoverAttention Reader [16] takes another path to model the interaction. The attentionoverattention mechanism involves calculating the attention between the passage attention \(\alpha (t)\) and the averaged question attention \(\beta \) after obtaining the similarity matrix \(\mathbf {M}\in \mathbb {R}^{n\times m}\) (Eq. 5.47). This operation is considered to learn the contributions of individual question words explicitly.
5.5.2.2 OpenDomain Question Answering
Opendomain QA (OpenQA) has been first proposed by [21]. The task aims to answer opendomain questions using external resources such as collections of documents [58], web pages [14, 29], structured knowledge graphs [3, 7] or automatically extracted relational triples [20].
Recently, with the development of machine reading comprehension techniques [11, 16, 19, 55, 63], researchers attempt to answer opendomain questions via performing reading comprehension on plain texts. Reference [12] proposes to employ neuralbased models to answer opendomain questions. As illustrated in Fig. 5.14, neuralbased OpenQA system usually retrieves relevant texts of the question from a largescale corpus and then extracts answers from these texts using reading comprehension models.
The DrQA system consists of two components: (1) The document retriever module for finding relevant articles and (2) the document reader model for extracting answers from given contexts.
The document retriever is used as a first quick skim to narrow the searching space and focus on documents that are likely to be relevant. The retriever builds TFIDF weighted bagofwords vectors for the documents and the questions, and computes similarity scores for ranking. To further utilize local word order information, the retriever uses bigram counts with hash while preserving both the speed and memory efficiency.
The document reader model takes in the top 5 Wikipedia articles yielded by the document retriever and extracts the final answer to the question. For each article, the document reader predicts an answer span with a confidence score. The final prediction is made by maximizing the unnormalized exponential of prediction scores across the documents.
Given each document d, the document reader first builds feature representation \(\tilde{\mathbf {d}}_i\) for each word in the document. The feature representation \(\tilde{\mathbf {d}}\) is made up by the following components.

1.
Word embeddings: The word embeddings \(f_{emb}({d})\) are obtained from largescale GloVe embeddings pretrained on Wikipedia.

2.
Manual features: The manual features \(f_{token}({d})\) combined partofspeech (POS) and named entity recognition tags and normalized Term Frequencies (TF).

3.
Exact match: This feature indicates whether \(d_i\) can be exactly matched to one question word in q.

4.
Aligned question embeddings: This feature aims to encode a soft alignment between words in the document and the question in the word embedding space.
$$\begin{aligned} f_{align}(d_i) = \sum _j\alpha _{ij}\mathbf {E}(q_j) \end{aligned}$$(5.48)$$\begin{aligned} \alpha _{ij}=\frac{\exp ({\text {MLP}}(\mathbf {E}(d_i))^\top {\text {MLP}}(\mathbf {E}(q_j)))}{\sum _{j'}\exp ({\text {MLP}}(\mathbf {E}(d_i))^\top {\text {MLP}}(\mathbf {E}(q_{j'})))} \end{aligned}$$(5.49)where \({\text {MLP}}(\mathbf {x})=\max (0, \mathbf {W}\mathbf {x}+\mathbf {b})\) and \(E(q_j)\) indicates the word embedding of the jth word in the question.
Finally, the feature representation is obtained by concatenating the above features:
Then the feature representation of the document is fed into a multilayer bidirectional LSTM (BiLSTM) to encode the contextual representation.
For the question, the contextual representation is simply obtained by encoding the word embeddings using a multilayer BiLSTM.
After that, the contextual representation is aggregated into a fixedlength vector using selfattention.
In the answer prediction phase, the start and end probability distributions are calculated following the paradigm mentioned in the Reading Comprehension Model section (Sect. 5.5.2.1).
Despite its success, the DrQA system is prone to noise in retrieved texts which may hurt the performance of the system. Hence, [15] and [61] attempt to solve the noise problem in DrQA via separating the question answering into paragraph selection and answer extraction, and they both only select the most relevant paragraph among all retrieved paragraphs to extract answers. They lose a large amount of rich information contained in those neglected paragraphs. Hence, [62] proposes strengthbased and coveragebased reranking approaches, which can aggregate the results extracted from each paragraph by the existing DSQA system to determine the answer better. However, the method relies on the preextracted answers of existing DSQA models and still suffers the noise issue in distant supervision data because it considers all retrieved paragraphs indiscriminately. To address this issue, [35] proposes a coarsetofine denoising OpenQA model, which employs a paragraph selector to filter out paragraphs and a paragraph reader to extract the correct answer from those denoised paragraphs.
5.6 Summary
In this chapter, we have introduced document representation learning, which encodes the semantic information of the whole document into a realvalued representation vector, providing an effective way of downstream tasks utilizing the document information and has significantly improved the performances of these tasks.
First, we introduce the onehot representation for documents. Next, we extensively introduce topic models to represent both words and documents using latent topic distribution. Further, we give an introduction on distributed document representation including paragraph vector and neural document representations. Finally, we introduce several typical realworld applications of document representations, including information retrieval and question answering.
In the future, for better document representation, some directions are requiring further efforts:

(1)
Incorporating External Knowledge. Current document representation approaches focus on representing documents with the semantic information of the whole document text. Moreover, knowledge bases provide external semantic information to better understand the realworld entities in the given document. Researchers have formed a consensus that incorporating entity semantics of knowledge bases into document representation is a potential way toward better document representation. Some existing work leverages various entity semantics to enhance the semantic information of document representation and achieves better performance in multiple applications such as document ranking [36, 65]. Explicitly modeling structural and textual semantic information as well as considering the entity importance for the given document also share some lights for a more interpretable and knowledgable document representation for downstream NLP tasks.

(2)
Considering Document Interactions. The candidate documents in downstream NLP tasks are usually relevant to each other and may help for better modeling document semantic information. There is no doubt that the interactions among documents, no matter with implicit semantic relations or with explicit links, will provide additional semantic signals to enhance the document representations. Reference [32] preliminarily uses document interactions to extract important words and improve model performance. Nevertheless, it remains an unsolved problem of how to effectively and explicitly incorporate semantic information into document representations from other documents.

(3)
Pretraining for Document Representation. Pretraining has shown effectiveness and thrives on downstream NLP tasks. Existing pretrained language models such as Word2vec style word cooccurrence models [38] and BERT style mask language models [18, 48] focus on the representation learning at the sentence level, which cannot work well for documentlevel representation. It is still challenging to model crosssentence relations, text coherence, and coreference at the document level in document representation learning. Moreover, there are also some methods that leverage useful signals such as anchordocument information to supervise document representation learning [67]. How to pretrain document representation models with efficient and effective strategies is still a critical and challenging problem.
References
Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and Alexander Smola. Scalable inference in latent variable models. In Proceedings of WSDM, 2012.
Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee Whye Teh. On smoothing and inference for topic models. In Proceedings of UAI, 2009.
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from questionanswer pairs. In Proceedings of EMNLP, 2013.
David M Blei, Thomas L Griffiths, and Michael I Jordan. The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. The Journal of the ACM, 57(2):7, 2010.
David M Blei and John D Lafferty. Dynamic topic models. In Proceedings of ICML, 2006.
David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. Largescale simple question answering with memory networks. arXiv preprint arXiv:1506.02075, 2015.
Jordan L BoydGraber and David M Blei. Syntactic topic models. In Proceedings of NeurIPS, 2009.
Jonathan Chang and David M Blei. Hierarchical relational models for document networks. The Annals of Applied Statistics, pages 124–150, 2010.
Danqi Chen. Neural Reading Comprehension and Beyond. PhD thesis, Stanford University, 2018.
Danqi Chen, Jason Bolton, and Christopher D. Manning. A thorough examination of the cnn/daily mail reading comprehension task. In Proceedings of ACL, 2016.
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer opendomain questions. In Proceedings of the ACL, 2017.
Jianfei Chen, Kaiwei Li, Jun Zhu, and Wenguang Chen. Warplda: a cache efficient o (1) algorithm for latent dirichlet allocation. Proceedings of VLDB, 2016.
Tongfei Chen and Benjamin Van Durme. Discriminative information retrieval for question answering sentence selection. In Proceedings of EACL, 2017.
Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Polosukhin, Alexandre Lacoste, and Jonathan Berant. Coarsetofine question answering for long documents. In Proceedings of ACL, 2017.
Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. Attentionoverattention neural networks for reading comprehension. In Proceedings of ACL, 2017.
Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. Convolutional neural networks for softmatching ngrams in adhoc search. In Proceedings of WSDM, 2018.
Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of NAACL, 2019.
Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William Cohen, and Ruslan Salakhutdinov. Gatedattention readers for text comprehension. In Proceedings of ACL, 2017.
Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. Open question answering over curated and extracted knowledge bases. In Proceedings of SIGKDD, 2014.
Bert F Green Jr, Alice K Wolf, Carol Chomsky, and Kenneth Laughery. Baseball: an automatic questionanswerer. In Proceedings of IREAIEEACM, 1961.
Thomas L Griffiths, Mark Steyvers, David M Blei, and Joshua B Tenenbaum. Integrating topics and syntax. In Proceedings of NeurIPS, 2004.
Jiafeng Guo, Yixing Fan, Qingyao Ai, and W.Bruce Croft. A deep relevance matching model for adhoc retrieval. In Proceedings of CIKM, 2016.
Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Proceedings of NeurIPS, 2015.
Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. Convolutional neural network architectures for matching natural language sentences. In Proceedings of NeurIPS, 2014.
PoSen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of CIKM, 2013.
Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. Pacrr: A positionaware neural ir model for relevance matching. In Proceedings of EMNLP, 2017.
Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer, and Jacob Eisenstein. Document context language models. arXiv preprint arXiv:1511.03962, 2015.
Cody Kwok, Oren Etzioni, and Daniel S Weld. Scaling question answering to the web. TOIS, pages 242–262, 2001.
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Largescale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In Proceedings of ICML, 2014.
Canjia Li, Yingfei Sun, Ben He, Le Wang, Kai Hui, Andrew Yates, Le Sun, and Jungang Xu. Nprf: A neural pseudo relevance feedback framework for adhoc information retrieval. In Proceedings of EMNLP, 2018.
Jiwei Li, Thang Luong, and Dan Jurafsky. A hierarchical neural autoencoder for paragraphs and documents. In Proceedings of ACL, 2015.
ChinYew Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
Yankai Lin, Haozhe Ji, Zhiyuan Liu, and Maosong Sun. Denoising distantly supervised opendomain question answering. In Proceedings of ACL, 2018.
Zhenghao Liu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. Entityduet neural ranking: Understanding the role of knowledge graph semantics in neural information retrieval. In Proceedings of ACL, 2018.
Jon D Mcauliffe and David M Blei. Supervised topic models. In Proceedings of NeurIPS, 2008.
T Mikolov and J Dean. Distributed representations of words and phrases and their compositionality. Proceedings of NeurIPS, 2013.
David Mimno and Andrew McCallum. Topic models conditioned on arbitrary features with dirichletmultinomial regression. In Proceedings of UAI, 2008.
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In Proceedings of WWW, 2017.
David Newman, Arthur U Asuncion, Padhraic Smyth, and Max Welling. Distributed inference for latent dirichlet allocation. In Proceedings of NeurIPS, 2007.
David Newman, Chaitanya Chemudugunta, and Padhraic Smyth. Statistical entitytopic models. In Proceedings of SIGKDD, 2006.
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. MS MARCO: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.
Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab Ward. Deep sentence embedding using long shortterm memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech and Language Processing, 24(4):694–707, 2016.
Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. Text matching as image recognition. In Proceedings of AAAI, 2016.
Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Jingfang Xu, and Xueqi Cheng. Deeprank: A new deep architecture for relevance ranking in information retrieval. In Proceedings of CIKM, 2017.
Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of EMNLP, 2014.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of NAACLHLT, pages 2227–2237, 2018.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of EMNLP, 2016.
Matthew Richardson, Christopher JC Burges, and Erin Renshaw. MCTest: A challenge dataset for the opendomain machine comprehension of text. In Proceedings of EMNLP, 2013.
Michal RosenZvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. The authortopic model for authors and documents. In Proceedings of UAI, 2004.
Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In Proceedings of ICLR, 2017.
Aliaksei Severyn and Alessandro Moschitti. Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of SIGIR, 2015.
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. A latent semantic model with convolutionalpooling structure for information retrieval. In Proceedings of CIKM, 2014.
Yelong Shen, PoSen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning to stop reading in machine comprehension. In Proceedings of SIGKDD, 2017.
Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
A Cuneyd Tantug, Kemal Oflazer, and Ilknur Durgar ElKahlout. Bleu+: a tool for finegrained bleu computation. 2008.
Ellen M Voorhees et al. The trec8 question answering track report. In Proceedings of TREC, 1999.
Hanna M Wallach. Topic modeling: beyond bagofwords. In Proceedings of ICML, 2006.
Chong Wang, Bo Thiesson, Chris Meek, and David Blei. Markov topic models. In Proceedings of AISTATS, 2009.
Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerald Tesauro, Bowen Zhou, and Jing Jiang. R3: Reinforced rankerreader for opendomain question answering. In Proceedings of AAAI, 2018.
Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang, Xiaoxiao Guo, Shiyu Chang, Zhiguo Wang, Tim Klinger, Gerald Tesauro, and Murray Campbell. Evidence aggregation for answer reranking in opendomain question answering. In Proceedings of ICLR, 2018.
Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. Gated selfmatching networks for reading comprehension and question answering. In Proceedings of ACL, 2017.
Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.
Chenyan Xiong, Jamie Callan, and TieYan Liu. Wordentity duet representations for document ranking. In Proceedings of SIGIR, 2017.
Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. Endtoend neural adhoc ranking with kernel pooling. In Proceedings of SIGIR, 2017.
Kaitao Zhang, Chenyan Xiong, Zhenghao Liu, and Zhiyuan Liu. Selective weak supervision for neural information retrieval. arXiv preprint arXiv:2001.10382, 2020.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2020 The Author(s)
About this chapter
Cite this chapter
Liu, Z., Lin, Y., Sun, M. (2020). Document Representation. In: Representation Learning for Natural Language Processing. Springer, Singapore. https://doi.org/10.1007/9789811555732_5
Download citation
DOI: https://doi.org/10.1007/9789811555732_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 9789811555725
Online ISBN: 9789811555732
eBook Packages: Computer ScienceComputer Science (R0)