Skip to main content

Document Representation


A document is usually the highest linguistic unit of natural language. Document representation aims to encode the semantic information of the whole document into a real-valued representation vector, which could be further utilized in downstream tasks. Recently, document representation has become an essential task in natural language processing and has been widely used in many document-level real-world applications such as information retrieval and question answering. In this chapter, we first introduce the one-hot representation for documents. Next, we extensively introduce topic models that learn the topic distribution of words and documents. Further, we give an introduction to distributed document representation, including paragraph vector and neural document representations. Finally, we introduce several typical real-world applications of document representation, including information retrieval and question answering.

5.1 Introduction

Advances in information and communication technologies offer ubiquitous access to vast amounts of information and are causing an exponential increase in the number of documents available online. While more and more textual information is available electronically, effective retrieval and mining are getting more and more difficult without the efficient organization, summarization, and indexing of document content. Therefore, document representation is playing an important role in many real-world applications, e.g., document retrieval, web search, and spam filtering. Document representation aims to represent document input into a fixed-length vector, which could describe the contents of the document, to reduce the complexity of the documents and make them easier to handle. Traditional document representation models such as one-hot document representation have achieved promising results in many document classification and clustering tasks due to their simplicity, efficiency, and often surprising accuracy.

However, the one-hot document representation model has many disadvantages. First, it loses the word order, and thus, different documents can have the same representation, as long as the same words are used. Second, it usually suffers data sparsity and high dimensionality. One-hot document representation model has very little sense about the semantics of the words or, more formally, the distances between the words. Hence, the approach for representing text documents uses multi-word terms as vector components, which are noun phrases extracted using a combination of linguistic and statistical criteria. This representation is motivated by the notion of topic models that terms should contain more semantic information than individual words. And another advantage of using terms for representing a document is its lower dimensionality compared with the traditional one-hot document representation.

Nevertheless, applying these to generation tasks remains difficult. To understand how discourse units are connected, one has to understand the communicative function of each unit, and the role it plays within the context that encapsulates it, recursively all the way up for the entire text. Identifying increasingly sophisticated human-developed features may be insufficient for capturing these patterns, but developing representation-based alternatives has also been difficult. Although document representation can capture aspects of coherent sentence structure, it is not clear how it could help in generating more broadly cohesive text.

Recently, neural network models have shown compelling results in generating meaningful and grammatical documents in sequence generation tasks like machine translation or parsing. It is partially attributed to the ability of these systems to capture local compositionally: the way neighboring words are combined semantically and syntactically to form meanings that they wish to express. Based on neural network models, many research works have developed a variety of ways to incorporate document-level contextual information. These models are all hybrid architectures in that they are recurrent at the sentence level, but use a different structure to summarize the context outside the sentence. Furthermore, some models explore multilevel recurrent architectures for combining local and global information in language modeling.

In this chapter, we first introduce the one-hot representation for documents. Next, we extensively introduce topic models that aim to learn latent topic distributions of words and documents. Further, we give an introduction on distributed document representation including paragraph vector and neural document representations. Finally, we introduce several typical real-world applications of document representations, including information retrieval and question answering.

5.2 One-Hot Document Representation

Majority of machine learning algorithms take a fixed-length vector as the input, so documents are needed to be represented as vectors. The bag-of-words model is the most common and simple representation method for documents. Similar to one-hot sentence representation, for a document \(d = \{w_1, w_2, \ldots , w_l\}\), a bag-of-word representation \(\mathbf {d}\) can be used to represent this document. Specifically, for a vocabulary \({V} = [w_1, w_2, \dots , w_{|{V}|}]\) , the one-hot representation of word w is \( \mathbf {w} = [0, 0, \dots , 1, \dots , 0]\). Based on the one-hot word representation and a vocabulary V, it can be extended to represent a document as

$$\begin{aligned} \mathbf {d} = \sum _{k=1}^{l}\mathbf {w}_{i}, \end{aligned}$$

where l is the length of the document d. And similar to one-hot sentence representation, the TF-IDF method is also proposed to enhance the ability of bag-of-words representation in reflecting how important a word is to a document in a corpus.

Actually, the bag-of-words representation is mainly used as a tool of feature generation, and the most common type of features calculated from this method is word frequency appearing in the documents. This method is simple but efficient and sometimes can reach excellent performance in many real-world applications. However, the bag-of-words representation still ignores entirely the word order information, which means different documents can have the same representation as long as the same words are used. Furthermore, bag-of-words representation has little sense about the semantics of the words or, more formally, the distances between words, which means this method cannot utilize rich information hidden in the word representations.

5.3 Topic Model

As our collective knowledge continues to be digitized and stored in the form of news, blogs, web pages, scientific articles, books, images, audio, videos, and social networks, it becomes more difficult to find and discover what we are looking for. We need new computational tools to help organize, search, and understand these vast amounts of information.

Right now, we work with online information using two main tools—search and links. We type keywords into a search engine and find a set of documents related to them. We look at the documents in that set, possibly navigating to other linked documents. This is a powerful way of interacting with our online archive, but something is missing.

Imagine searching and exploring documents based on the themes that run through them. We might “zoom in” and “zoom out” to find specific or broader themes; we might look at how those themes changed through time or how they are connected. Rather than finding documents through keyword search alone, we might first find the theme that we are interested in, and then examine the documents related to that theme.

For example, consider using themes to explore the complete history of the New York Times. At a broad level, some of the themes might correspond to the sections of the newspaper, such as foreign policy, national affairs, and sports. We could zoom in on a theme of interest, such as foreign policy, to reveal various aspects of it, such as Chinese foreign policy, the conflict in the Middle East, and the United States’ relationship with Russia. We could then navigate through time to reveal how these specific themes have changed, tracking, for example, the changes in the conflict in the Middle East over the last 50 years. And, in all of this exploration, we would be pointed to the original articles relevant to the themes. The thematic structure would be a new kind of window through which to explore and digest the collection.

But we do not interact with electronic archives in this way. While more and more texts are available online, we do not have the human power to read and study them to provide the kind of browsing experience described above. To this end, machine learning researchers have developed probabilistic topic modeling, a suite of algorithms that aim to discover and annotate vast archives of documents with thematic information. Topic modeling algorithms are statistical methods that analyze the words of the original texts to explore the themes that run through them, how those themes are connected, and how they change over time. Topic modeling algorithms do not require any prior annotations or labeling of the documents. The topics emerge from the analysis of the original texts. Topic modeling enables us to organize and summarize electronic archives at a scale that would be impossible by human annotation.

5.3.1 Latent Dirichlet Allocation

A variety of probabilistic topic models have been used to analyze the content of documents and the meaning of words. Hofmann first introduced the probabilistic topic approach to document modeling in his Probabilistic Latent Semantic Indexing method (pLSI). The pLSI model does not make any assumptions about how the mixture weights are generated, making it difficult to test the generalization ability of the model to new documents. Thus, Latent Dirichlet Allocation (LDA) was extended from this model by introducing a Dirichlet prior to the model. LDA is believed as a simple but efficient topic model. We first describe the basic ideas of LDA [6].

The intuition behind LDA is that documents exhibit multiple topics. LDA is a statistical model of document collections that tries to capture this intuition. It is most easily described by its generative process, the imaginary random process by which the model assumes the documents arose.

We formally define a topic to be a distribution over a fixed vocabulary. We assume that these topics are specified before any data has been generated. Now for each document in the collection, we generate the words in a two-stage process.

  1. 1.

    Randomly choose a distribution over topics.

  2. 2.

    For each word in the document,

    • Randomly choose a topic from the distribution over topics in step #1.

    • Randomly choose a word from the corresponding distribution over the vocabulary.

This statistical model reflects the intuition that documents exhibit multiple topics. Each document exhibits the topics with different proportions (step #1); each word in each document is drawn from one of the topics (step #2b), where the selected topic is chosen from the per-document distribution over topics (step #2a).

We emphasize that the algorithms have no information about these subjects and the articles are not labeled with topics or keywords. The interpretable topic distributions arise by computing the hidden structure that likely generated the observed collection of documents. LDA and Probabilistic Models

LDA and other topic models are part of the broader field of probabilistic modeling. In generative probabilistic modeling, we treat our data as arising from a generative process that includes hidden variables. This generative process defines a joint probability distribution over both the observed and hidden random variables. Given the observed variables, we perform data analysis by using that joint distribution to compute the conditional distribution of the hidden variables. This conditional distribution is also called the posterior distribution.

LDA falls precisely into this framework. The observed variables are the words of the documents, the hidden variables are the topic structure, and the generative process is as described above. The computational problem of inferring the hidden topic structure from the documents is the problem of computing the posterior distribution, the conditional distribution of the hidden variables given the documents.

We can describe LDA more formally with the following notation. The topics are \(\beta _{1:K}\), where each \(\beta _k\) is a distribution over the vocabulary. The topic proportions for the dth document are \(\theta _d\), where \(\theta _{dk}\) is the topic proportion for topic k in document d. The topic assignments for the dth document are \(z_d\), where \(z_{d,n}\) is the topic assignment for the nth word in document d. Finally, the observed words for document d are \(w_d\), where \(w_{d,n}\) is the nth word in document d, which is an element from the fixed vocabulary.

With this notation, the generative process for LDA corresponds to the following joint distribution of the hidden and observed variables:

$$\begin{aligned} P(\beta _{1:K},\theta _{1:D},z_{1:D},w_{1:D}) = \prod _{i=1}^K P(\beta _i) \prod _{d=1}^D P(\theta _d) (\prod _{n=1}^N P(z_{d,n}|\theta _d)P(w_{d,n}|\beta _{1:K},z_{d,n}). \end{aligned}$$

Notice that this distribution specifies the number of dependencies. For example, the topic assignment \(z_{d,n}\) depends on the per-document topic proportions \(\theta _d\). As another example, the observed word \(w_{d,n}\) depends on the topic assignment \(z_{d,n}\) and all of the topics \(\beta _{1:K}\).

These dependencies define LDA. They are encoded in the statistical assumptions behind the generative process, in the particular mathematical form of the joint distribution, and in a third way, in the probabilistic graphical model for LDA. Probabilistic graphical models provide a graphical language for describing families of probability distributions. The graphical model for LDA is in Fig. 5.1. Each node is a random variable and is labeled according to its role in the generative process. The hidden nodes, the topic proportions, assignments, and topics are unshaded. The observed nodes and the words of the documents, are shaded. We use rectangles as plate notation to denote replication. The N plate denotes the collection of words within documents; the D plate denotes the collection of documents within the collection. These three representations are equivalent ways of describing the probabilistic assumptions behind LDA.

Fig. 5.1
figure 1

The architecture of graphical model for Latent Dirichlet Allocation Posterior Computation for LDA

We now turn to the computational problem, computing the conditional distribution of the topic structure given the observed documents. (As we mentioned above, this is called the posterior.) Using our notation, the posterior is

$$\begin{aligned} P(\beta _{1:K},\theta _{1:D},z_{1:D}|v_{1:D}) = \frac{P(\beta _{1:K},\theta _{1:D},z_{1:D},v_{1:D})}{P(v_{1:D})}. \end{aligned}$$

The numerator is the joint distribution of all the random variables, which can be easily computed for any setting of the hidden variables. The denominator is the marginal probability of the observations, which is the probability of seeing the observed corpus under any topic model. In theory, it can be computed by summing the joint distribution over every possible instantiation of the hidden topic structure.

Topic modeling algorithms form an approximation of the above equation by forming an alternative distribution over the latent topic structure that is adapted to be close to the true posterior. Topic modeling algorithms generally fall into two categories: sampling-based algorithms and variational algorithms.

Sampling-based algorithms attempt to collect samples from the posterior by approximating it with an empirical distribution. The most commonly used sampling algorithm for topic modeling is Gibbs sampling, where we construct a Markov chain, a sequence of random variables, each dependent on the previous—whose limiting distribution is posterior. The Markov chain is defined on the hidden topic variables for a particular corpus, and the algorithm is to run the chain for a long time, collect samples from the limiting distribution, and then approximate the distribution with the collected samples.

Variational methods are a deterministic alternative to sampling-based algorithms. Rather than approximating the posterior with samples, variational methods posit a parameterized family of distributions over the hidden structure and then find the member of that family that is closest to the posterior. Thus, the inference problem is transformed into an optimization problem. Variational methods open the door for innovations in optimization to have a practical impact on probabilistic modeling.

5.3.2 Extensions

The simple LDA model provides a powerful tool for discovering and exploiting the hidden thematic structure in large archives of text. However, one of the main advantages of formulating LDA as a probabilistic model is that it can easily be used as a module in more complicated models for more complex goals. Since its introduction, LDA has been extended and adapted in many ways. Relaxing the Assumptions of LDA

LDA is defined by the statistical assumptions it makes about the corpus. One active area of topic modeling research is how to relax and extend these assumptions to uncover a more sophisticated structure in the texts.

One assumption that LDA makes is the bag-of-words assumption that the order of the words in the document does not matter. While this assumption is unrealistic, it is reasonable if our only goal is to uncover the coarse semantic structure of the texts. For more sophisticated goals, such as language generation, it is patently not appropriate. There have been many extensions to LDA that model words non-exchangeable. For example, [59] develops a topic model that relaxes the bag-of-words assumption by assuming that the topics generate words conditional on the previous word; [22] develops a topic model that switches between LDA and a standard HMM. These models expand the parameter space significantly but show improved language modeling performance.

Another assumption is that the order of documents does not matter. Again, this can be seen by noticing that Eq. 5.3 remains invariant to permutations of the ordering of documents in the collection. This assumption may be unrealistic when analyzing long-running collections that span years or centuries. In such collections, we may want to assume that the topics change over time. One approach to this problem is the dynamic topic model [5], a model that respects the ordering of the documents and gives a more productive posterior topical structure than LDA.

The third assumption about LDA is that the number of topics is assumed known and fixed. The Bayesian nonparametric topic model provides an elegant solution: The collection determines the number of topics during posterior inference, and new documents can exhibit previously unseen topics. Bayesian nonparametric topic models have been extended to hierarchies of topics, which find a tree of topics, moving from more general to more concrete, whose particular structure is inferred from the data [4]. Incorporating Meta-Data to LDA

In many text analysis settings, the documents contain additional information such as author, title, geographic location, links, and others that we might want to account for when fitting a topic model. There has been a flurry of research on adapting topic models to include meta-data.

The author-topic model [51] is an early success story for this kind of research. The topic proportions are attached to authors; papers with multiple authors are assumed to attach each word to an author, drawn from a topic drawn from his or her topic proportions. The author-topic model allows for inferences about authors as well as documents.

Many document collections are linked. For example, scientific papers are linked by citations, or web pages are connected by hyperlinks. And several topic models have been developed to account for those links when estimating the topics. The relational topic model of [9] assumes that each document is modeled as in LDA and that the links between documents depend on the distance between their topic proportions. This is both a new topic model and a new network model. Unlike traditional statistical models of networks, the relational topic model takes into account node attributes in modeling the links.

Other work that incorporates meta-data into topic models includes models of linguistic structure [8], models that account for distances between corpora [60], and models of named entities [42]. General-purpose methods for incorporating meta-data into topic models include Dirichlet-multinomial regression models [39] and supervised topic models [37]. Acceleration

In the existing fast algorithms, it is difficult to decouple the access to \( C_{d}\) and \( C_{w}\) because both counts need to be updated instantly after the sampling of every token. Many algorithms have been proposed to accelerate LDA based on this equation. WarpLDA [13] is built based on a new Monte Carlo Expectation Maximization (MCEM) algorithm, which is similar to CGS, but both counts are fixed until the sampling of all tokens is finished. This scheme can be used to develop a reordering strategy to decouple the accesses to \( C_d\) and \( C_w\), and minimize the size of randomly accessed memory.

Specifically, WarpLDA seeks a MAP solution of the latent variables \( \varTheta \) and \( \varPhi \), with the latent topic assignments Z integrated out: where \(\alpha ^\prime \) and \(\beta ^\prime \) are the Dirichlet hyperparameters. Reference [2] has shown that this MAP solution is almost identical with the solution of CGS, with proper hyperparameters.

Computing \(\log P(\varTheta , \varPhi | W, \alpha ', \beta ')\) directly is expensive because it needs to enumerate all the K possible topic assignments for each token. We, therefore, optimize its lower bound as a surrogate. Let Q(Z) be a variational distribution. Then, by Jensen’s inequality, the lower bound can be \(\mathscr {J}( \varTheta , \varPhi , Q( Z))\):

$$\begin{aligned} \log P(\varTheta , \varPhi | W, \alpha ', \beta ')\ge&\mathbb {E}_Q[\log P( W, Z | \varTheta , \varPhi ) - \log Q( Z)] + \log P( \varTheta | \alpha ') + \log P( \varPhi | \beta ') \nonumber \\ \triangleq&\mathscr {J}( \varTheta , \varPhi , Q( Z)). \end{aligned}$$

An Expectation Maximization (EM) algorithm is implemented to find a local maximum of the posterior \(P( \varTheta , \varPhi | W, \alpha ^{\prime }, \beta ^{\prime })\), where the E-step maximizes \(\mathscr {J}\) with respect to the variational distribution Q(Z) and the M-step maximizes \(\mathscr {J}\) with respect to the model parameters \(( \varTheta , \varPhi )\), while keeping Q(Z) fixed. One can prove that the optimal solution at E-step is \(Q( Z) = P( Z | W, \varTheta , \varPhi )\) without further assumption on Q. We apply Monte Carlo approximation on the expectation in Eq. 5.4,

$$\begin{aligned} \mathbb {E}_Q[\log P( W, Z | \varTheta , \varPhi ) - \log Q( Z)]\approx \frac{1}{S}\sum _{s=1}^S\log P( W, Z^{(s)} | \varTheta , \varPhi ) - \log Q( Z^{(s)}), \end{aligned}$$

where \( Z^{(1)}, \dots , Z^{(S)} \sim Q( Z)=P( Z | W, \varTheta , \varPhi ).\) The sample size is set as \(S=1\) and the model uses Z as an abbreviation of \( Z^{(1)}\).

Sampling Z: Each dimension of Z can be sampled independently:

$$\begin{aligned} Q(z_{d, n}=k)\propto P( W, Z | \varTheta , \varPhi ) \propto \theta _{dk} \phi _{w_{d, n}, k}. \end{aligned}$$

Optimizing \( \varTheta , \varPhi \): With the Monte Carlo approximation, we have

$$\begin{aligned} \mathscr {J}&\approx \log P( W, Z | \varTheta , \varPhi ) + \log P( \varTheta | \alpha ^\prime ) + \log P( \varPhi | \beta ^\prime ) + \text{ const. } \nonumber \\&= \sum _{d, k} (C_{dk}+\alpha ^\prime _k-1)\log \theta _{dk} + \sum _{k, w}(C_{kw}+\beta ^\prime -1)\log \phi _{kw} + \text{ const. }, \end{aligned}$$

and with the optimal solutions, we have

$$\begin{aligned} \hat{\theta }_{dk} \propto C_{dk} + \alpha ^\prime _k - 1, ~~~~ \hat{\phi }_{wk} = \frac{C_{wk} + \beta ^\prime - 1}{C_k + \bar{\beta }^\prime - V}. \end{aligned}$$

Instead of computing and storing \(\hat{ \varTheta }\) and \(\hat{ \varPhi }\), we compute and store \( C_{d}\) and \( C_{w}\) to save memory because the latter are sparse. Plug Eqs. 5.85.6, and let \( \alpha = \alpha ^\prime -1, \beta =\beta ^\prime -1\), we get the full MCEM algorithm, which iteratively performs the following two steps until a given iteration number is reached:

  • E-step: We can sample \(z_{d, n}\sim Q(z_{d, n}=k)\) according to

    $$\begin{aligned} Q(z_{d, n}=k)\propto (C_{dk} + \alpha _k)\frac{C_{wk} + \beta _w}{C_k + \bar{\beta }} . \end{aligned}$$
  • M-step: Compute \( C_{d}\) and \( C_{w}\) by Z.

Note the resemblance intuitively justifies why MCEM leads to similar results with CGS. The difference between MCEM and CGS is that MCEM updates the counts \( C_d\) and \( C_w\) after sampling all \(z_{d, n}\)s, while CGS updates the counts instantly after sampling each \(z_{d, n}\). The strategy that MCEM updates the counts after sampling all \(z_{d, n}\)s is called delayed count update, or simply delayed update. MCEM can be viewed as a CGS with a delayed update, which has been widely used in other algorithms [1, 41]. While previous work uses the delayed update as a trick, we at this moment present a theoretical guarantee to converge to a MAP solution. The delayed update is essential for us to decouple the accesses of \( C_d\) and \( C_w\) to improve cache locality, without affecting the correctness.

5.4 Distributed Document Representation

To address the disadvantages of bag-of-words document representation, [31] proposes paragraph vector models, including the version with Distributed Memory (PV-DM) and the version with Distributed Bag-of-Words (PV-DBOW). Moreover, researchers also proposed several hierarchical neural network models to represent documents. In this section, we will introduce these models in detail.

5.4.1 Paragraph Vector

As shown in Fig. 5.2, paragraph vector maps every paragraph to a unique vector, represented by a column in the matrix \(\mathbf {P}\) and maps every word to a unique vector, represented by a column in word embedding matrix \(\mathbf {E}\). The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context. More formally, compared to the word vector framework, the only change in this model is in the following equation, where h is constructed from \(\mathbf {E}\) and \(\mathbf {P}\).

$$\begin{aligned} y = {\text {Softmax}}(h(w_{t-k}, \ldots , w_{t+k}; \mathbf {E}, \mathbf {P})), \end{aligned}$$

where h is constructed by the concatenation or average of word vectors extracted from \(\mathbf {E}\) and \(\mathbf {P}\).

The other part of this model is that given a sequence of training words \(w_1\), \(w_2\), \(w_3\), ..., \(w_l\), the objective of the paragraph vector model is to maximize the average log probability:

$$\begin{aligned} \mathscr {O} = \frac{1}{l}\sum _{i=k}^{l-k}\log P(w_i \mid w_{i-k}, \ldots , w_{i+k}). \end{aligned}$$

And the prediction task is typically done via a multi-class classifier, such as softmax. Thus, the probability equation is

$$\begin{aligned} P(w_i \mid w_{i-k}, \ldots , w_{i+k}) = \frac{e^{y_{w_i}}}{\sum _{j}e^{y_j}}. \end{aligned}$$

The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the current context, or the topic of the paragraph. For this reason, this model is often called the Distributed Memory Model of Paragraph Vectors (PV-DM).

Fig. 5.2
figure 2

The architecture of PV-DM model

The above method considers the concatenation of the paragraph vector with the word vectors to predict the next word in a text window. Another way is to ignore the context words in the input, but force the model to predict words randomly sampled from the paragraph in the output. In reality, what this means is that at each iteration of stochastic gradient descent, we sample a text window, then sample a random word from the text window and form a classification task given the Paragraph Vector. This technique is shown in Fig. 5.3. This version is named the Distributed Bag-of-Words version of Paragraph Vector (PV-DBOW), as opposed to the Distributed Memory version of Paragraph Vector (PV-DM) in the previous section.

Fig. 5.3
figure 3

The architecture of PV-DBOW model

In addition to being conceptually simple, this model requires to store fewer data. The data only needed to be stored is the softmax weights as opposed to both softmax weights and word vectors in the previous model. This model is also similar to the Skip-gram model in word vectors.

5.4.2 Neural Document Representation

In this part, we introduce two main kinds of neural networks for document representation including document-context language model and hierarchical document autoencoder. Document-Context Language Model

Recurrent architectures can be used to combine local and global information in document language modeling. The simplest such model would be to train a single RNN, ignoring sentence boundaries as mentioned above; the last hidden state from the previous sentence \(t-1\) is used to initialize the first hidden state in sentence t. In such an architecture, the length of the RNN is equal to the number of tokens in the document; in typical genres such as news texts, this means training RNNs from sequences of several hundred tokens, which introduces two problems: (1) Information decay In a sentence with thirty tokens (not unusual in news text), the contextual information from the previous sentence must be propagated through the recurrent dynamics thirty times before it can reach the last token of the current sentence. Meaningful document-level information is unlikely to survive such a long pipeline. (2) Learning It is notoriously difficult to train recurrent architectures that involve many time steps. In the case of an RNN trained on an entire document, back-propagation would have to run over hundreds of steps, posing severe numerical challenges.

To address these two issues, [28] proposes to use multilevel recurrent structures to represent documents, thereby successfully efficiently leveraging document-level context in language modeling. They first proposed Context-to-Context Document-Context Language Model (ccDCLM), which assumes that contextual information from previous sentences needs to be able to “short-circuit” the standard RNN, so as to more directly impact the generation of words across longer spans of text. Formally, we have

$$\begin{aligned} \mathbf {c}_{t-1} = \mathbf {h}_{t-1,l}, \end{aligned}$$

where l is the length of sentence \(t-1\). The ccDCLM model then creates additional paths for this information to impact each hidden representation in the current sentence t. Writing \(\mathbf {w}_{t,n}\) for the word representation of the nth word in the tth sentence, we have

$$\begin{aligned} \mathbf {h}_{t,n} =&{g}_{\theta }(\mathbf {h}_{t,n-1},{f}(\mathbf {w}_{t,n},\mathbf {c}_{t-1}), \end{aligned}$$

where \({g}_{\theta }(\cdot )\) is the activation function parameterized by \(\theta \) and \({f}(\cdot )\) is a function that combines the context vector with the input \(\mathbf {x}_{t,n}\) for the hidden state. Here we simply concatenate the representations,

$$\begin{aligned} {f}(\mathbf {x}_{t,n},\mathbf {c}_{t-1})= [\mathbf {x}_{t,n};\mathbf {c}_{t-1}]. \end{aligned}$$

The emission probability for \(\mathbf {y}_{t,n}\) is then computed from \(\mathbf {h}_{t,n}\) as in the standard RNNLM. The underlying assumption of this model is that contextual information should impact the generation of each word in the current sentence. The model, therefore, introduces computational “short-circuits” for cross-sentence information, as illustrated in Fig. 5.4.

Fig. 5.4
figure 4

The architecture of ccDCLM model

Besides, they also proposed Context-to-Output Document-Context Language Model (coDCLM). Rather than incorporating the document context into the recurrent definition of the hidden state, the coDCLM model pushes it directly to the output, as illustrated in Fig. 5.5. Let \(\mathbf {h}_{t,n}\) be the hidden state from a conventional RNNLM of sentence t,

$$\begin{aligned} \mathbf {h}_{t,n} = {g}_{\theta }(\mathbf {h}_{t,n-1}, \mathbf {x}_{t,n}). \end{aligned}$$

Then, the context vector \(\mathbf {c}_{t-1}\) is directly used in the output layer as

$$\begin{aligned} \mathbf {y}_{t,n}\sim {\text {Softmax}}(\mathbf {W}_{h}\mathbf {h}_{t,n} + \mathbf {W}_{c}\mathbf {c}_{t-1} + \mathbf {b}). \end{aligned}$$
Fig. 5.5
figure 5

The architecture of coDCLM model Hierarchical Document Autoencoder

Reference [33] also proposes hierarchical document autoencoder to represent documents. The model draws on the intuition that just as the juxtaposition of words creates a joint meaning of a sentence, the juxtaposition of sentences also creates a joint meaning of a paragraph or a document.

Fig. 5.6
figure 6

The architecture of hierarchical document autoencoder

They first obtain representation vectors at the sentence level by putting one layer of LSTM (denoted as \({\text {LSTM}}_{{encode}}^{{word}}\)) on top of its containing words:

$$\begin{aligned} \begin{aligned}&h_t^w(\text {enc})={\text {LSTM}}_{{encode}}^{{word}}(\mathbf {w}_t,h_{t-1}^v(\text {enc})). \end{aligned} \end{aligned}$$

The vector output at the ending time step is used to represent the entire sentence as

$$\begin{aligned} \begin{aligned} \mathbf {s}=h_{{end_s}}^w. \end{aligned} \end{aligned}$$

To build representation \(e_D\) for the current document/paragraph, another layer of LSTM (denoted as \({\text {LSTM}}_{{encode}}^{{sentence}}\)) is placed on top of all sentences, computing representations sequentially for each time step:

$$\begin{aligned} \begin{aligned}&h_t^s(\text {enc})={\text {LSTM}}_{{encode}}^{{sentence}}(\mathbf {s},h_{t-1}^s(\text {enc})). \end{aligned} \end{aligned}$$

Representation \(h_{{end_D}}^s\) computed at the final time step is used to represent the entire document: \(\mathbf {d}=h_{{end_D}}^s\).

Thus one LSTM operates at the token level, leading to the acquisition of sentence-level representations that are then used as inputs into the second LSTM that acquires document-level representations, in a hierarchical structure.

As with encoding, the decoding algorithm operates on a hierarchical structure with two layers of LSTMs. LSTM outputs at sentence level for time step t are obtained by

$$\begin{aligned} \begin{aligned}&h_{t}^s(\text {dec})={\text {LSTM}}_{{decode}}^{{sentence}}(\mathbf {s}_{t},h_{t-1}^s(\text {dec})). \end{aligned} \end{aligned}$$

The initial time step \(h_0^s(d)=e_D\), the end-to-end output from the encoding procedure \(h_{t}^s(d)\) is used as the original input into \({\text {LSTM}}_{{decode}}^{word}\) for subsequently predicting tokens within sentence \(t+1\). \({\text {LSTM}}_{{decode}}^{word}\) predicts tokens at each position sequentially, the embedding of which is then combined with earlier hidden vectors for the next time-step prediction until the \(end_s\) token is predicted. The procedure can be summarized as follows:

$$\begin{aligned} h_{t}^w(\text {dec})={\text {LSTM}}_{{decode}}^{{sentence}}(\mathbf {w}_t,h_{t-1}^w(\text {dec})), \end{aligned}$$
$$\begin{aligned} P(w|\cdot )={\text {Softmax}}(\mathbf {w}, h_{t-1}^w(\text {dec})). \end{aligned}$$

During decoding, \({\text {LSTM}}_{{decode}}^{word}\) generates each word token w sequentially and combines it with earlier LSTM-outputted hidden vectors. The LSTM hidden vector computed at the final time step is used to represent the current sentence.

This is passed to \({\text {LSTM}}_{{decode}}^{sentence}\), combined with \(h_{t}^s\) for the acquisition of \(h_{t+1}\), and outputted to the next time step in sentence decoding. For each time step t, \({\text {LSTM}}_{{decode}}^{sentence}\) has to first decide whether decoding should proceed or come to a full stop: we add an additional token \({end}_D\) to the vocabulary. Decoding terminates when token \({end}_D\) is predicted. Details are shown in Fig. 5.6.

Fig. 5.7
figure 7

The architecture of hierarchical document autoencoder with attentions

Attention models adopt a look-back strategy by linking the current decoding stage with input sentences in an attempt to consider which part of the input is most responsible for the current decoding state (Fig. 5.7).

Let \(H=\{h_1^s(e), h_2^s(e), \ldots , h^s_{N}(e)\}\) be the collection of sentence-level hidden vectors for each sentence from the inputs, outputted from \({\text {LSTM}}_{{encode}}^{{sentence}}\). Each element in H contains information about input sequences with a strong focus on the parts surrounding each specific sentence (time step). During decoding, suppose that \(e_{t}^s\) denotes the sentence-level embedding at current step and that \(h_{t-1}^s(\text {dec})\) denotes the hidden vector outputted from \({\text {LSTM}}_{decode}^{sentence}\) at previous time step \(t-1\). Attention models would first link the current-step decoding information, i.e., \(h_{t-1}^s(\text {dec})\) which is outputted from \({\text {LSTM}}_{dec}^{sentence}\) with each of the input sentences \(i\in [1, N]\), characterized by a strength indicator \(v_i\):

$$\begin{aligned} v_i=\mathbf {U}^\top f(\mathbf {W}_1\cdot h_{t-1}^s(\text {dec})+\mathbf {W}_2\cdot h_i^s(\text {enc})), \end{aligned}$$

where \(\mathbf {W}_1, \mathbf {W}_2\in \mathbb {R}^{K\times K}\), \(\mathbf {U}\in \mathbb {R}^{K\times 1}\). \(v_i\) is then normalized

$$\begin{aligned} \begin{aligned} \alpha _i=\frac{\exp (v_i)}{\sum _{j}\exp (v_j)}. \end{aligned} \end{aligned}$$

The attention vector is then created by averaging weights over all input sentences:

$$\begin{aligned} \mathbf {m}_t=\sum _{i=1}^{N_D}\alpha _i h_i^s(\text {enc}) \end{aligned}$$

5.5 Applications

In this section, we will introduce several applications on document level analysis based on representation learning.

5.5.1 Neural Information Retrieval

Information retrieval aims to obtain relevant resources from a large-scale collection of information resources. As shown in Fig. 5.8, given the query “Steve Jobs” as input, the search engine (a typical application of information retrieval) provides relevant web pages for users. Traditional information retrieval data consists of search queries and document collections D. And the ground truth is available through explicit human judgments or implicit user behavior data such as click-through rate.

Fig. 5.8
figure 8

An example of information retrieval

For the given query q and document d, traditional information retrieval models estimate their relevance through lexical matches. Neural information retrieval models pay more attention to garner the query and document relevance from semantic matches. Both lexical and semantic matches are essential for neural information retrieval. Thriving from neural network black magic, it helps information retrieval models catch more sophisticated matching features and have achieved the state of the art in the information retrieval task [17].

Current neural ranking models can be categorized into two groups: representation-based and interaction-based [23]. The earlier works mainly focus on representation-based models. They learn good representations and match them in the learned representation space of queries and documents. Interaction-based methods, on the other hand, model the query-document matches from the interactions of their terms. Representation-Based Neural Ranking Models

The representation-based methods directly match the query and documents by learning two distributed representations, respectively, and then compute the matching score based on the similarity between them. In recent years, several deep neural models have been explored based on such Siamese architecture, which can be done by feedforward layers, convolutional neural networks, or recurrent neural networks.

Reference [26] proposes Deep Structured Semantic Models (DSSM) first to hash words to the letter-trigram-based representation. And then use a multilayer fully connected neural network to encode a query (or a document) as a vector. The relevance between the query and document can be simply calculated with the cosine similarity. Reference [26] trains the model by minimizing the cross-entropy loss on click-through data where each training sample consists of a query q, a positive document \(d^+\), and a uniformly sampled negative document set \(D^-\):

$$\begin{aligned} \mathscr {L}_{DSSM}(q,d^+, D^-)=- \log \left( \frac{e^{r \cdot \text {cos} (\mathbf {q},\mathbf {d^+})}}{\sum _{d \in D} e^{r \cdot \text {cos}(\mathbf {q},\mathbf {d})} }\right) , \end{aligned}$$

where \(D={d^+} \cup D^-\).

Furthermore, CDSSM [54] and ARC-I [25] utilize convolutional neural network (CNN), while LSTM-RNN [44] adopts recurrent neural network with Long Short-Term Memory (LSTM) units to represent a sentence better. Reference [53] also comes up with a more sophisticated similarity function by leveraging additional layers of the neural network.

Fig. 5.9
figure 9

The architecture of interaction-based neural ranking models Interaction-Based Neural Ranking Models

The interaction-based neural ranking models learn word-level interaction patterns from query-document pairs, as shown in Fig. 5.9. And they provide an opportunity to compare different parts of the query with different parts of the document individually and aggregate the partial evidence of relevance. ARC-II [25] and MatchPyramind [45] utilize convolutional neural network to capture complicated patterns from word-level interactions. The Deep Relevance Matching Model (DRMM) uses pyramid pooling (histogram) to summarize the word-level similarities into ranking models [23]. There are also some works establishing position-dependent interactions for ranking models [27, 46].

Kernel-based Neural Ranking Model (K-NRM) [66] and its convolutional version Conv-KNRM [17] achieve the state of the art in neural information retrieval. K-NRM first establishes a translation matrix \(\mathbf {M}\) in which each element \(\mathbf {M}_{ij}\) is the cosine similarity of ith word in q and jth word in d. Then K-NRM utilizes kernels to convert translation matrix \(\mathbf {M}\) to ranking features \(\phi (\mathbf {M})\) :

$$\begin{aligned} \phi (\mathbf {\mathbf {M}}) = \sum _{i=1}^{n} \log \mathbf {K}(\mathbf {M}_i), \end{aligned}$$
$$\begin{aligned} \mathbf {K}(\mathbf {\mathbf {M}}_i) = \{K_1(\mathbf {M}_i), \ldots , K_K(\mathbf {M}_i)\}. \end{aligned}$$

Each RBF kernel \(K_k\) calculates how word pair similarities are distributed:

$$\begin{aligned} K_k(\mathbf {\mathbf {M}}_i) = \sum _{j} \exp \left( - \frac{(\mathbf {M}_{ij}-\mu _k)^2}{2 \sigma _K^2}\right) . \end{aligned}$$

Then the relevance of q and d is calculated by a ranking layer:

$$\begin{aligned} f(q,d)=\tanh (\mathbf {w}^\top \phi (\mathbf {M}) + b), \end{aligned}$$

where \(\mathbf {w}\) and b are trainable parameters.

Reference  [66] trains the model by minimizing pair-wise loss on click-through data:

$$\begin{aligned} \mathscr {L} = \sum _{q} \sum _{d^+,d^- \in D^{+,-}} \max (0, 1-f(q,d^+) + f(q,d^-)). \end{aligned}$$

For the given query q, \(D^{+,-}\) are the pair-wise preferences from the ground truth. \(d^+\) and \(d^-\) are two documents such that \(d^+\) is more relevant with q than \(d^-\). Conv-KNRM extends K-NRM to model n-gram semantic matches based on the convolutional neural network which can leverage snippet information. Summary

Representation-based models and interaction-based models extract match features from overall and local aspects, respectively. They can also be combined for further improvements [40].

Recently, large-scale knowledge graphs such as DBpedia, Yago, and Freebase have emerged. Knowledge graphs contain human knowledge about real-world entities and become an opportunity for search systems to understand queries and documents better. The emergence of large-scale knowledge graphs has motivated the development of entity-oriented search, which brings in entities and semantics from the knowledge graphs and has dramatically improved the effectiveness of feature-based search systems.

Entity-oriented search and neural ranking models push the boundary of matching from two different perspectives. Reference [36] incorporates semantics from knowledge graphs into the neural ranking, such as entity descriptions and entity types. This work significantly improves the effectiveness and generalization ability of interaction-based neural ranking models. However, how to fully leverage semi-structured knowledge graphs and establish semantic relevance between queries and documents remains an open question.

Information retrieval has been widely used in many natural language processing tasks such as reading comprehension and question answering. Therefore, it is no doubt that neural information retrieval will lead to a new tendency for these tasks.

5.5.2 Question Answering

Question Answering (QA) is one of the most important tasks and so are document-level applications in NLP. Many efforts have been invested in QA, especially in machine reading comprehension and open-domain QA. In this section, we will introduce the advances in these two tasks, respectively. Machine Reading Comprehension

As shown in Fig. 5.10, machine reading comprehension aims to determine the answer a to the question q given a passage p. The task could be viewed as a supervised learning problem: given a collection of training examples \(\{(p_i, q_i, a_i)\}_{i=1}^n\), we want to learn a mapping \(f(\cdot )\) that takes the passage \(p_i\) and corresponding question \(q_i\) as inputs and outputs \(\hat{a}_i\), where \(evaluate(\hat{a}_i,a_i)\) is maximized. The evaluation metric is typically correlated with the answer type, which will be discussed in the following.

Fig. 5.10
figure 10

An example of machine reading comprehension from SQuAD [49]

Generally, the current machine reading comprehension task could be divided into four categories depending on the answer types according to [10], i.e., cloze style, multiple choices, span prediction, and free-form answer.

The cloze style task such as CNN/Daily Mail [24] consists of fill-in-the-blank sentences where the question contains a placeholder to be filled in. The answer a is either chosen from a predefined candidate set |A| or from the vocabulary |V|. The multiple-choice task such as RACE [30] and MCTest [50] aims to select the best answer from a set of answer choices. It is typical to use accuracy to measure the performance on these two tasks: the percentage of correctly answered questions in the whole example set, since the question could be either correctly answered or not from the given hypothesized answer set.

The span prediction task such as SQuAD [49] is perhaps the most widely adopted task among all, since it takes compromises between flexibility and simplicity. The task is to extract a most likely text span from the passage as the answer to the question, which is usually modeled as predicting the start position \(idx_{start}\) and end position \(idx_{end}\) of the answer span. To evaluate the predicted answer span \(\hat{a}\), we typically use two evaluation metrics proposed by [49]. Exact match assigns full score 1.0 to the predicted answer span \(\hat{a}\) if it exactly equals the ground truth answer a, otherwise 0.0. F1 score measures the degree of overlap between \(\hat{a}\) and a by computing a harmonic mean of the precision and recall.

The free-form answer task such as MS MARCO [43] does not restrict the answer form or length and is also referred to as generative question answering. It is practical to model the task as a sequence generation problem, where the discrete token-level prediction was made. Currently, a consensus on what is the ideal evaluation metrics has not been achieved. It is common to adopt standard metrics in machine translation and summarization, including ROUGE [34] and BLEU [57].

As a critical component in the question answering system, the surging neural-based machine reading comprehension models have greatly boosted the task of question answering in the last decades.

Fig. 5.11
figure 11

The architecture of bidirectional LSTM reader model

The first attempt [24] to apply neural networks on machine reading comprehension constructs bidirectional LSTM reader models along with attention mechanisms. The work introduces two reader models, i.e., the attentive reader and the impatient reader, as shown in Fig. 5.11. After encoding the passage and the query into hidden states using LSTMs, the attentive reader computes a scalar distribution s(t) over the passage tokens and uses it to compute the weighted sum of the passage hidden states r. The impatient reader extends this idea further by recurrently updating the weighted sum of passage hidden states after it has seen each query token.

The attention mechanisms used in reading comprehension could be viewed as a variant of Memory Networks [64]. Memory Networks use long-term memory units to store information for inference dynamically. Typically, given an input x, the model first converts it into an internal feature representation F(x). Then, the model can update the designated memory units \(m_i\) given the new input: \(m_i=g(m_i, F(x), m)\), or generate output features o given the new input and the memory states: \(o=f(F(x), m)\). Finally, the model converts the output into the response with the desired format: \(r=R(o)\). The key takeaway of Memory Networks is the retaining and updating of some internal memories that captivate global information. We will see how this idea is further extended in some sophisticated models.

It is no doubt that the application of attention to machine reading comprehension greatly promotes researches in this field. Following [11], the work [24] modifies the method to compute attention and simplify the prediction layer in the attentive reader. Instead of using \(tanh(\cdot )\) to compute the relevance between the passage representations \(\{\tilde{\mathbf {p}_i}\}_{i=1}^n\) and the query hidden state \(\mathbf {q}\) (see Eq. 5.33), Chen et al. use the bilinear terms to directly capture the passage-query alignment (see Eq. 5.34).

$$\begin{aligned} \alpha _i = {\text {Softmax}}_i({\text {tanh}}(\mathbf {W}_1{\tilde{\mathbf {p}}}_i + \mathbf {W}_2\mathbf {q})), \end{aligned}$$
$$\begin{aligned} \alpha _i ={\text {Softmax}}_i (\mathbf {q}^{\top }\mathbf {W}_3{\tilde{\mathbf {p}}}_i). \end{aligned}$$

Most machine reading comprehension models follow the same paradigm to locate the start and endpoint of the answer span. As shown in Fig. 5.12, while encoding the passage, the model retains the length of the sequence and encodes the question into a fixed-length hidden representation \(\mathbf {q}\). The question’s hidden vector is then used as a pointer to scan over the passage representation \(\{\mathbf {p}_i\}_{i=1}^n\) and compute scores on every position in the passage. While maintaining this similar architecture, most machine reading comprehension models vary in the interaction methods between the passage and the question. In the following, we will introduce several classic reading comprehension architectures that follow this paradigm.

Fig. 5.12
figure 12

The architecture of classic machine reading comprehension models

First, we introduce BiDAF, which is short for Bi-Directional Attention Flow [52]. The BiDAF network consists of the token embedding layer, the contextual embedding layer, the bi-directional attention flow layer, the LSTM modeling layer, and the softmax output layer, as shown in Fig. 5.13.

The token embedding layer consists of two levels. First, the character embedding layer encodes each word in character level by adopting a 1D convolutional neural network (CNN). Specifically, for each word, characters are embedded into fixed-length vectors, which are considered as 1D input for CNNs. The outputs are then max-pooled along the embedding dimension to obtain a single fixed-length vector. Second, the word embedding layer uses pretrained word vectors, i.e., GloVe [47], to map each word into a high-dimensional vector directly.

Fig. 5.13
figure 13

The architecture of BiDAF model

Then the concatenation of the two vectors is fed into a two-layer Highway Network [56]. Equation 5.35 shows one layer of the highway network used in the paper, where \(H_1(\cdot )\) and \(H_2(\cdot )\) represent two affine transformations:

$$\begin{aligned} \mathbf {g}&= {\text {Sigmoid}}(H_1(\mathbf {x})), \end{aligned}$$
$$\begin{aligned} \mathbf {y}&= \mathbf {g} \odot {\text {ReLU}}(H_2(\mathbf {x})) + (1-\mathbf {g}) \odot \mathbf {x}. \end{aligned}$$

After feeding the context and the query to the token embedding layer, we obtain \(\mathbf {X}\in \mathbb {R}^{d\times T}\) for the context and \(\mathbf {Q}\in \mathbb {R}^{d\times J}\) for the query, respectively. Afterward, the contextual embedding layer, which is a bidirectional LSTM, model the temporal interaction between words for both the context and the query.

Then, come to the attention flow layer. In this layer, the attention dependency is computed in both directions, i.e., the context-to-query (C2Q) attention and the query-to-context (Q2C) attention. For both kinds of attention, we first compute a similarity matrix \(\mathbf {S}\in \mathbb {R}^{T\times J}\) using the contextual embeddings of the context \(\mathbf {H}\) and the query \(\mathbf {U}\) obtained from the last layer (Eq. 5.37). In the equation, \(\alpha (\cdot )\) computes the scalar similarity of the given two vectors and \(\mathbf {m}\) is a trainable weight vector.

$$\begin{aligned} \mathbf {S}_{tj}&=\alpha (\mathbf {H}_{:,t}, \mathbf {U}_{:,j}) \end{aligned}$$
$$\begin{aligned} \alpha (\mathbf {h}, \mathbf {u})&= \mathbf {m}^{\top }[\mathbf {h};\mathbf {u};\mathbf {h}\odot \mathbf {u}], \end{aligned}$$

where \(\odot \) indicates element-wise product.

For the C2Q attention, a weighted sum of contextual query embeddings is computed given each context word. The attention distribution over the query is obtained by \(\mathbf {a}_j={\text {Softmax}}(\mathbf {S}_{j,:})\in \mathbb {R}^{J}\). The final attended query vector is therefore \(\tilde{\mathbf {U}}_{:,t}=\sum _ja_{tj}\mathbf {U}_{:,j}\) for each context word.

For the Q2C attention, the context embeddings are merged into a single fixed length hidden vector \(\tilde{\mathbf {h}}\). The attention distribution over the context is computed by \(\mathbf {b}_t={\text {Softmax}}(\max _j \mathbf {S}_{tj})\), and \(\tilde{\mathbf {h}}=\sum _t \mathbf {b}_t\mathbf {H}_{:,t}\). Lastly, the merged context embeddings are tiled T times along the column to produce \(\tilde{\mathbf {H}}\).

Finally, the attended outputs are combined to yield \(\mathbf {G}\), which is defined by Eq. 5.39

$$\begin{aligned} \mathbf {G}_{:,t}&= \phi (\mathbf {H}_{:,t}, \tilde{\mathbf {U}}_{:,t}, \tilde{\mathbf {H}}_{:,t}) \end{aligned}$$
$$\begin{aligned} \beta (\mathbf {h}, \tilde{\mathbf {u}}, \tilde{\mathbf {h}})&= [\mathbf {h};\tilde{\mathbf {u}};\mathbf {h}\odot \tilde{\mathbf {u}};\mathbf {h}\odot \tilde{\mathbf {h}}]. \end{aligned}$$

Afterward, the LSTM modeling layer takes \(\mathbf {G}\) as input and encodes it using a two-layer bidirectional LSTM. The output \(\mathbf {M}\in \mathbb {R}^{2d\times T}\) is combined with \(\mathbf {G}\) to yield the final start and end probability distributions over the passage.

$$\begin{aligned} P^1&={\text {Softmax}}(\mathbf {u}_1^{\top }[\mathbf {G};\mathbf {M}]), \end{aligned}$$
$$\begin{aligned} P^2&={\text {Softmax}}(\mathbf {u}_2^{\top }[\mathbf {G};{\text {LSTM}}(\mathbf {M})]), \end{aligned}$$

where \(\mathbf {u}_1, \ \mathbf {u}_2\) are two trainable weight vectors.

To train the model, the negative log likelihood loss is adopted and the goal is to maximize the probability of the golden start index \(idx_{start}\) and end index \(idx_{end}\) being selected by the model,

$$\begin{aligned} \mathscr {L} = -\frac{1}{N}\sum _{i=1}^N\left( \log (P^1_{idx_{start}^i}) + \log (P^2_{idx_{start}^i}) \right) . \end{aligned}$$

Besides BiDAF, where attention dependencies are computed in two directions, we will also briefly introduce other interaction methods between the query and the passage. The Gated-Attention Reader proposed by [19] adopts the gated attention module, where each token representation of the passage \(d_i\) is scaled by the attended query vector \(\mathbf {Q}\) after each Bi-GRU layer (Eq. 5.44).

$$\begin{aligned} \alpha _i&={\text {Softmax}}(\mathbf {Q}^\top \mathbf {d}_i) \end{aligned}$$
$$\begin{aligned} \tilde{q}_i&=\mathbf {Q}\alpha _i \end{aligned}$$
$$\begin{aligned} \mathbf {x}_i&= \mathbf {d}_i \odot \tilde{\mathbf {q}}_i. \end{aligned}$$

This gated attention mechanism allows the query to directly interact with the token embeddings of the passage at the semantic level. And such layer-wise interaction enables the model to learn conditional token representation given the question at different representation levels.

The Attention-over-Attention Reader [16] takes another path to model the interaction. The attention-over-attention mechanism involves calculating the attention between the passage attention \(\alpha (t)\) and the averaged question attention \(\beta \) after obtaining the similarity matrix \(\mathbf {M}\in \mathbb {R}^{n\times m}\) (Eq. 5.47). This operation is considered to learn the contributions of individual question words explicitly.

$$\begin{aligned} \alpha (t) = {\text {Softmax}}(\mathbf {M}_{:,t}), \nonumber \\ \beta = \frac{1}{N}\sum _{t=1}^N {\text {Softmax}}(\mathbf {M}_{t,:}). \end{aligned}$$
(5.47) Open-Domain Question Answering

Open-domain QA (OpenQA) has been first proposed by [21]. The task aims to answer open-domain questions using external resources such as collections of documents [58], web pages [14, 29], structured knowledge graphs [3, 7] or automatically extracted relational triples [20].

Fig. 5.14
figure 14

An example of open-domain question answering

Recently, with the development of machine reading comprehension techniques [11, 16, 19, 55, 63], researchers attempt to answer open-domain questions via performing reading comprehension on plain texts. Reference [12] proposes to employ neural-based models to answer open-domain questions. As illustrated in Fig. 5.14, neural-based OpenQA system usually retrieves relevant texts of the question from a large-scale corpus and then extracts answers from these texts using reading comprehension models.

The DrQA system consists of two components: (1) The document retriever module for finding relevant articles and (2) the document reader model for extracting answers from given contexts.

The document retriever is used as a first quick skim to narrow the searching space and focus on documents that are likely to be relevant. The retriever builds TF-IDF weighted bag-of-words vectors for the documents and the questions, and computes similarity scores for ranking. To further utilize local word order information, the retriever uses bigram counts with hash while preserving both the speed and memory efficiency.

The document reader model takes in the top 5 Wikipedia articles yielded by the document retriever and extracts the final answer to the question. For each article, the document reader predicts an answer span with a confidence score. The final prediction is made by maximizing the unnormalized exponential of prediction scores across the documents.

Given each document d, the document reader first builds feature representation \(\tilde{\mathbf {d}}_i\) for each word in the document. The feature representation \(\tilde{\mathbf {d}}\) is made up by the following components.

  1. 1.

    Word embeddings: The word embeddings \(f_{emb}({d})\) are obtained from large-scale GloVe embeddings pretrained on Wikipedia.

  2. 2.

    Manual features: The manual features \(f_{token}({d})\) combined part-of-speech (POS) and named entity recognition tags and normalized Term Frequencies (TF).

  3. 3.

    Exact match: This feature indicates whether \(d_i\) can be exactly matched to one question word in q.

  4. 4.

    Aligned question embeddings: This feature aims to encode a soft alignment between words in the document and the question in the word embedding space.

    $$\begin{aligned} f_{align}(d_i) = \sum _j\alpha _{ij}\mathbf {E}(q_j) \end{aligned}$$
    $$\begin{aligned} \alpha _{ij}=\frac{\exp ({\text {MLP}}(\mathbf {E}(d_i))^\top {\text {MLP}}(\mathbf {E}(q_j)))}{\sum _{j'}\exp ({\text {MLP}}(\mathbf {E}(d_i))^\top {\text {MLP}}(\mathbf {E}(q_{j'})))} \end{aligned}$$

    where \({\text {MLP}}(\mathbf {x})=\max (0, \mathbf {W}\mathbf {x}+\mathbf {b})\) and \(E(q_j)\) indicates the word embedding of the jth word in the question.

Finally, the feature representation is obtained by concatenating the above features:

$$\begin{aligned} \tilde{\mathbf {d}}_i=(f_{emb}(d_i), f_{token}(d_i), f_{exact\_match}(d_i), f_{align}(d_i)). \end{aligned}$$

Then the feature representation of the document is fed into a multilayer bidirectional LSTM (BiLSTM) to encode the contextual representation.

$$\begin{aligned} \mathbf {d}_1,\ldots ,\mathbf {d}_n={\text {BiLSTM}}(\tilde{\mathbf {d}}_1,\ldots ,\tilde{\mathbf {d}}_n). \end{aligned}$$

For the question, the contextual representation is simply obtained by encoding the word embeddings using a multilayer BiLSTM.

$$\begin{aligned} \mathbf {q}_1,\ldots ,\mathbf {q}_m={\text {BiLSTM}}(\tilde{\mathbf {q}}_1,\ldots ,\tilde{\mathbf {q}}_m) \end{aligned}$$

After that, the contextual representation is aggregated into a fixed-length vector using self-attention.

$$\begin{aligned} b_j&=\frac{\exp (\mathbf {u}^{\top }\mathbf {q}_j)}{\sum _{j'}\exp (\mathbf {u}^{\top }\mathbf {q}_{j'})}\end{aligned}$$
$$\begin{aligned} \mathbf {q}&= \sum _j b_j\mathbf {q}_j. \end{aligned}$$

In the answer prediction phase, the start and end probability distributions are calculated following the paradigm mentioned in the Reading Comprehension Model section (Sect.

$$\begin{aligned} P^{start}(i)&= \frac{\exp (\mathbf {d}_i^{\top }\mathbf {W}^{start}\mathbf {q})}{\sum _{i'}\exp (\mathbf {d}_{i'}^{\top }\mathbf {W}^{start}\mathbf {q})} \end{aligned}$$
$$\begin{aligned} P^{end}(i)&= \frac{\exp (\mathbf {d}_i^{\top }\mathbf {W}^{end}\mathbf {q})}{\sum _{i'}\exp (\mathbf {d}_{i'}^{\top }\mathbf {W}^{end}\mathbf {q})}. \end{aligned}$$

Despite its success, the DrQA system is prone to noise in retrieved texts which may hurt the performance of the system. Hence, [15] and [61] attempt to solve the noise problem in DrQA via separating the question answering into paragraph selection and answer extraction, and they both only select the most relevant paragraph among all retrieved paragraphs to extract answers. They lose a large amount of rich information contained in those neglected paragraphs. Hence, [62] proposes strength-based and coverage-based re-ranking approaches, which can aggregate the results extracted from each paragraph by the existing DS-QA system to determine the answer better. However, the method relies on the pre-extracted answers of existing DS-QA models and still suffers the noise issue in distant supervision data because it considers all retrieved paragraphs indiscriminately. To address this issue, [35] proposes a coarse-to-fine denoising OpenQA model, which employs a paragraph selector to filter out paragraphs and a paragraph reader to extract the correct answer from those denoised paragraphs.

5.6 Summary

In this chapter, we have introduced document representation learning, which encodes the semantic information of the whole document into a real-valued representation vector, providing an effective way of downstream tasks utilizing the document information and has significantly improved the performances of these tasks.

First, we introduce the one-hot representation for documents. Next, we extensively introduce topic models to represent both words and documents using latent topic distribution. Further, we give an introduction on distributed document representation including paragraph vector and neural document representations. Finally, we introduce several typical real-world applications of document representations, including information retrieval and question answering.

In the future, for better document representation, some directions are requiring further efforts:

  1. (1)

    Incorporating External Knowledge. Current document representation approaches focus on representing documents with the semantic information of the whole document text. Moreover, knowledge bases provide external semantic information to better understand the real-world entities in the given document. Researchers have formed a consensus that incorporating entity semantics of knowledge bases into document representation is a potential way toward better document representation. Some existing work leverages various entity semantics to enhance the semantic information of document representation and achieves better performance in multiple applications such as document ranking [36, 65]. Explicitly modeling structural and textual semantic information as well as considering the entity importance for the given document also share some lights for a more interpretable and knowledgable document representation for downstream NLP tasks.

  2. (2)

    Considering Document Interactions. The candidate documents in downstream NLP tasks are usually relevant to each other and may help for better modeling document semantic information. There is no doubt that the interactions among documents, no matter with implicit semantic relations or with explicit links, will provide additional semantic signals to enhance the document representations. Reference [32] preliminarily uses document interactions to extract important words and improve model performance. Nevertheless, it remains an unsolved problem of how to effectively and explicitly incorporate semantic information into document representations from other documents.

  3. (3)

    Pretraining for Document Representation. Pretraining has shown effectiveness and thrives on downstream NLP tasks. Existing pre-trained language models such as Word2vec style word co-occurrence models [38] and BERT style mask language models [18, 48] focus on the representation learning at the sentence level, which cannot work well for document-level representation. It is still challenging to model cross-sentence relations, text coherence, and co-reference at the document level in document representation learning. Moreover, there are also some methods that leverage useful signals such as anchor-document information to supervise document representation learning [67]. How to pretrain document representation models with efficient and effective strategies is still a critical and challenging problem.


  1. Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and Alexander Smola. Scalable inference in latent variable models. In Proceedings of WSDM, 2012.

    Google Scholar 

  2. Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee Whye Teh. On smoothing and inference for topic models. In Proceedings of UAI, 2009.

    Google Scholar 

  3. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of EMNLP, 2013.

    Google Scholar 

  4. David M Blei, Thomas L Griffiths, and Michael I Jordan. The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. The Journal of the ACM, 57(2):7, 2010.

    Google Scholar 

  5. David M Blei and John D Lafferty. Dynamic topic models. In Proceedings of ICML, 2006.

    Google Scholar 

  6. David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.

    Google Scholar 

  7. Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075, 2015.

  8. Jordan L Boyd-Graber and David M Blei. Syntactic topic models. In Proceedings of NeurIPS, 2009.

    Google Scholar 

  9. Jonathan Chang and David M Blei. Hierarchical relational models for document networks. The Annals of Applied Statistics, pages 124–150, 2010.

    Google Scholar 

  10. Danqi Chen. Neural Reading Comprehension and Beyond. PhD thesis, Stanford University, 2018.

    Google Scholar 

  11. Danqi Chen, Jason Bolton, and Christopher D. Manning. A thorough examination of the cnn/daily mail reading comprehension task. In Proceedings of ACL, 2016.

    Google Scholar 

  12. Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer open-domain questions. In Proceedings of the ACL, 2017.

    Google Scholar 

  13. Jianfei Chen, Kaiwei Li, Jun Zhu, and Wenguang Chen. Warplda: a cache efficient o (1) algorithm for latent dirichlet allocation. Proceedings of VLDB, 2016.

    Google Scholar 

  14. Tongfei Chen and Benjamin Van Durme. Discriminative information retrieval for question answering sentence selection. In Proceedings of EACL, 2017.

    Google Scholar 

  15. Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Polosukhin, Alexandre Lacoste, and Jonathan Berant. Coarse-to-fine question answering for long documents. In Proceedings of ACL, 2017.

    Google Scholar 

  16. Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. Attention-over-attention neural networks for reading comprehension. In Proceedings of ACL, 2017.

    Google Scholar 

  17. Zhuyun Dai, Chenyan Xiong, Jamie Callan, and Zhiyuan Liu. Convolutional neural networks for soft-matching n-grams in ad-hoc search. In Proceedings of WSDM, 2018.

    Google Scholar 

  18. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL, 2019.

    Google Scholar 

  19. Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William Cohen, and Ruslan Salakhutdinov. Gated-attention readers for text comprehension. In Proceedings of ACL, 2017.

    Google Scholar 

  20. Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. Open question answering over curated and extracted knowledge bases. In Proceedings of SIGKDD, 2014.

    Google Scholar 

  21. Bert F Green Jr, Alice K Wolf, Carol Chomsky, and Kenneth Laughery. Baseball: an automatic question-answerer. In Proceedings of IRE-AIEE-ACM, 1961.

    Google Scholar 

  22. Thomas L Griffiths, Mark Steyvers, David M Blei, and Joshua B Tenenbaum. Integrating topics and syntax. In Proceedings of NeurIPS, 2004.

    Google Scholar 

  23. Jiafeng Guo, Yixing Fan, Qingyao Ai, and W.Bruce Croft. A deep relevance matching model for ad-hoc retrieval. In Proceedings of CIKM, 2016.

    Google Scholar 

  24. Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Proceedings of NeurIPS, 2015.

    Google Scholar 

  25. Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai Chen. Convolutional neural network architectures for matching natural language sentences. In Proceedings of NeurIPS, 2014.

    Google Scholar 

  26. Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of CIKM, 2013.

    Google Scholar 

  27. Kai Hui, Andrew Yates, Klaus Berberich, and Gerard de Melo. Pacrr: A position-aware neural ir model for relevance matching. In Proceedings of EMNLP, 2017.

    Google Scholar 

  28. Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer, and Jacob Eisenstein. Document context language models. arXiv preprint arXiv:1511.03962, 2015.

  29. Cody Kwok, Oren Etzioni, and Daniel S Weld. Scaling question answering to the web. TOIS, pages 242–262, 2001.

    Google Scholar 

  30. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. RACE: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.

  31. Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In Proceedings of ICML, 2014.

    Google Scholar 

  32. Canjia Li, Yingfei Sun, Ben He, Le Wang, Kai Hui, Andrew Yates, Le Sun, and Jungang Xu. Nprf: A neural pseudo relevance feedback framework for ad-hoc information retrieval. In Proceedings of EMNLP, 2018.

    Google Scholar 

  33. Jiwei Li, Thang Luong, and Dan Jurafsky. A hierarchical neural autoencoder for paragraphs and documents. In Proceedings of ACL, 2015.

    Google Scholar 

  34. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.

    Google Scholar 

  35. Yankai Lin, Haozhe Ji, Zhiyuan Liu, and Maosong Sun. Denoising distantly supervised open-domain question answering. In Proceedings of ACL, 2018.

    Google Scholar 

  36. Zhenghao Liu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. Entity-duet neural ranking: Understanding the role of knowledge graph semantics in neural information retrieval. In Proceedings of ACL, 2018.

    Google Scholar 

  37. Jon D Mcauliffe and David M Blei. Supervised topic models. In Proceedings of NeurIPS, 2008.

    Google Scholar 

  38. T Mikolov and J Dean. Distributed representations of words and phrases and their compositionality. Proceedings of NeurIPS, 2013.

    Google Scholar 

  39. David Mimno and Andrew McCallum. Topic models conditioned on arbitrary features with dirichlet-multinomial regression. In Proceedings of UAI, 2008.

    Google Scholar 

  40. Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In Proceedings of WWW, 2017.

    Google Scholar 

  41. David Newman, Arthur U Asuncion, Padhraic Smyth, and Max Welling. Distributed inference for latent dirichlet allocation. In Proceedings of NeurIPS, 2007.

    Google Scholar 

  42. David Newman, Chaitanya Chemudugunta, and Padhraic Smyth. Statistical entity-topic models. In Proceedings of SIGKDD, 2006.

    Google Scholar 

  43. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. MS MARCO: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.

  44. Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab Ward. Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech and Language Processing, 24(4):694–707, 2016.

    CrossRef  Google Scholar 

  45. Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng. Text matching as image recognition. In Proceedings of AAAI, 2016.

    Google Scholar 

  46. Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Jingfang Xu, and Xueqi Cheng. Deeprank: A new deep architecture for relevance ranking in information retrieval. In Proceedings of CIKM, 2017.

    Google Scholar 

  47. Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of EMNLP, 2014.

    Google Scholar 

  48. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of NAACL-HLT, pages 2227–2237, 2018.

    Google Scholar 

  49. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of EMNLP, 2016.

    Google Scholar 

  50. Matthew Richardson, Christopher JC Burges, and Erin Renshaw. MCTest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of EMNLP, 2013.

    Google Scholar 

  51. Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. The author-topic model for authors and documents. In Proceedings of UAI, 2004.

    Google Scholar 

  52. Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In Proceedings of ICLR, 2017.

    Google Scholar 

  53. Aliaksei Severyn and Alessandro Moschitti. Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of SIGIR, 2015.

    Google Scholar 

  54. Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of CIKM, 2014.

    Google Scholar 

  55. Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning to stop reading in machine comprehension. In Proceedings of SIGKDD, 2017.

    Google Scholar 

  56. Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.

  57. A Cuneyd Tantug, Kemal Oflazer, and Ilknur Durgar El-Kahlout. Bleu+: a tool for fine-grained bleu computation. 2008.

    Google Scholar 

  58. Ellen M Voorhees et al. The trec-8 question answering track report. In Proceedings of TREC, 1999.

    Google Scholar 

  59. Hanna M Wallach. Topic modeling: beyond bag-of-words. In Proceedings of ICML, 2006.

    Google Scholar 

  60. Chong Wang, Bo Thiesson, Chris Meek, and David Blei. Markov topic models. In Proceedings of AISTATS, 2009.

    Google Scholar 

  61. Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerald Tesauro, Bowen Zhou, and Jing Jiang. R3: Reinforced ranker-reader for open-domain question answering. In Proceedings of AAAI, 2018.

    Google Scholar 

  62. Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang, Xiaoxiao Guo, Shiyu Chang, Zhiguo Wang, Tim Klinger, Gerald Tesauro, and Murray Campbell. Evidence aggregation for answer re-ranking in open-domain question answering. In Proceedings of ICLR, 2018.

    Google Scholar 

  63. Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. Gated self-matching networks for reading comprehension and question answering. In Proceedings of ACL, 2017.

    Google Scholar 

  64. Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.

  65. Chenyan Xiong, Jamie Callan, and Tie-Yan Liu. Word-entity duet representations for document ranking. In Proceedings of SIGIR, 2017.

    Google Scholar 

  66. Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power. End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of SIGIR, 2017.

    Google Scholar 

  67. Kaitao Zhang, Chenyan Xiong, Zhenghao Liu, and Zhiyuan Liu. Selective weak supervision for neural information retrieval. arXiv preprint arXiv:2001.10382, 2020.

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Zhiyuan Liu .

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and Permissions

Copyright information

© 2020 The Author(s)

About this chapter

Verify currency and authenticity via CrossMark

Cite this chapter

Liu, Z., Lin, Y., Sun, M. (2020). Document Representation. In: Representation Learning for Natural Language Processing. Springer, Singapore.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-5572-5

  • Online ISBN: 978-981-15-5573-2

  • eBook Packages: Computer ScienceComputer Science (R0)