4.1 Introduction

Natural language sentences consist of words or phrases, follow grammatical rules, and convey complete semantic information. Compared with words and phrases, sentences have more complex structures, including both sequential and hierarchical structures, which are essential for understanding sentences. In NLP, how to represent sentences is critical for related applications, such as sentence classification, sentiment analysis, sentence matching, and so on.

Before deep learning took off, sentences were usually represented as one-hot vectors or TF-IDF vectors, following the assumption of bag-of-words. In this case, a sentence is represented as a vocabulary-sized vector, in which each element represents the importance of a specific word (either term frequency or TF-IDF) to the sentence. However, this method confronts two issues. Firstly, the dimension of such representation vectors is usually up to thousands or millions. Thus, they usually face sparsity problem and bring in computational efficiency problem. Secondly, such a representation method follows the bag-of-words assumption and ignores the sequential and structural information, which can be crucial for understanding the semantic meanings of sentences.

Inspired by recent advances of deep learning models in computer vision and speech, researchers proposed to model sentences with deep neural networks, such as convolutional neural network, recurrent neural network, and so on. Compared with conventional word frequency-based sentence representations, deep neural networks can capture the internal structures of sentences, e.g., sequential and dependency information, through convolutional or recurrent operations. Thus, neural network-based sentence representations have achieved great success in sentence modeling and NLP tasks.

4.2 One-Hot Sentence Representation

One-hot representation is the most simple and straightforward method for word representation tasks. This method represents each word with a fixed length binary vector. Specifically, for a vocabulary \(V = \{w_1, w_2, \ldots , w_{|V|}\}\), the one-hot representation of word w is \( \mathbf {w} = [0, \ldots , 0, 1, 0, \ldots , 0]\). Based on the one-hot word representation and the vocabulary, it can be extended to represent a sentence \( s = \{ w_1, w_2, \ldots , w_l \} \) as

$$\begin{aligned} \mathbf {s} = \sum _{k=1}^{l}\mathbf {w}_{i}, \end{aligned}$$
(4.1)

where l indicates the length of the sentence s. The sentence representation \(\mathbf {s}\) is the sum of the one-hot representations of n words within the sentence, i.e., each element in \(\mathbf {s}\) represents the Term Frequency (TF) of the corresponding word.

Moreover, researchers usually take the importance of different words into consideration, rather than treat all the words equally. For example, the function words such as “a”, “an”, and “the” usually appear in different sentences, and reserve little meanings. Therefore, the Inverse Document Frequency (IDF) is employed to measure the importance of \(w_i\) in V as follows:

$$\begin{aligned} \text {idf}_{w_i} = \log \frac{|D|}{\text {df}_{w_i}}, \end{aligned}$$
(4.2)

where |D| is the number of all documents in the corpus D and \(\text {df}_{w_i}\) represents the Document Frequency (DF) of \(w_i\).

With the importance of each word, the sentences are represented more precisely as follows:

$$\begin{aligned} \hat{\mathbf {s}} = \mathbf {s}\otimes \text {idf}, \end{aligned}$$
(4.3)

where \(\otimes \) is the element-wise product.

Here, \(\hat{\mathbf {s}}\) is the TF-IDF representation of the sentence s.

4.3 Probabilistic Language Model

One-hot sentence representation usually neglects the structure information in a sentence. To address this issue, researchers propose probabilistic language model, which treats n-grams rather than words as the basic components. An n-gram means a subsequence of words in a context window of length n, and probabilistic language model defines the probability of a sentence \(s=[w_1, w_2, \ldots , w_l]\) as

$$\begin{aligned} P(s) = \prod _{i=1}^{l}P(w_i| {w}_{1}^{i-1}). \end{aligned}$$
(4.4)

Actually, model indicated in Eq. (4.4) is not practicable due to its enormous parameter space. In practice, we simplify the model and set an n-sized context window, assuming that the probability of word \(w_{i} \) only depends on \([w_{i - n + 1}\cdots w_{i-1}]\). More specifically, an n-gram language model predicts word \(w_i\) in the sentence s based on its previous \(n-1\) words. Therefore, the simplified probability of a sentence is formalized as

$$\begin{aligned} P(s) = \prod _{i=1}^{l}P(w_i| {w}_{i-n+1}^{i-1}), \end{aligned}$$
(4.5)

where the probability of selecting the word \(w_i\) can be calculated from n-gram model frequency counts:

$$\begin{aligned} P(w_i | {w}_{i-n+1}^{i-1}) = \frac{P( {w}_{i-n+1}^{i})}{P( {w}_{i-n+1}^{i-1})}. \end{aligned}$$
(4.6)

Typically, the conditional probabilities in n-gram language models are not calculated directly from the frequency counts, since it suffers severe problems when confronted with any n-grams that have not explicitly been seen before. Therefore, researchers proposed several types of smoothing approaches, which assign some of the total probability mass to unseen words or n-grams, such as “add-one” smoothing, Good-Turing discounting, or back-off models.

n-gram model is a typical probabilistic language model for predicting the next word in an n-gram sequence, which follows the Markov assumption that the probability of the target word only relies on the previous \(n-1\) words. The idea is employed by most of current sentence modeling methods. n-gram language model is used as an approximation of the true underlying language model. This assumption is crucial because it massively simplifies the problem of learning the parameters of language models from data. Recent works on word representation learning [3, 40, 43] are mainly based on the n-gram language model.

4.4 Neural Language Model

Although smoothing approaches could alleviate the sparse problem in the probabilistic language model, it still performs poorly for those unseen or uncommon words and n-grams. Moreover, since probabilistic language models are constructed on larger and larger texts, the number of unique words (the vocabulary) increases and the number of possible sequences of words increases exponentially with the size of the vocabulary, causing a data sparsity problem. Thus statistics are needed to estimate probabilities accurately.

To address this issue, researchers propose neural language models which use continuous representations or embeddings of words and neural networks to make their predictions, in which embeddings in the continuous space help to alleviate the curse of dimensionality in language modeling, and neural networks avoid this problem by representing words in a distributed way, as nonlinear combinations of weights in a neural net [2]. An alternate description is that a neural network approximates the language function. The neural net architecture might be feedforward or recurrent, and while the former is simpler, the latter is more common.

Similar to probabilistic language models, neural language models are constructed and trained as probabilistic classifiers that learn to predict a probability distribution:

$$\begin{aligned} P(s) = \prod _{i=1}^{l}P(w_i|\mathbf {w}_{1}^{i-1}), \end{aligned}$$
(4.7)

where the conditional probability of the selecting word \(w_i\) can be calculated by various kinds of neural networks such as feedforward neural networks, recurrent neural networks, and so on. In the following sections, we will introduce these neural language models in detail.

4.4.1 Feedforward Neural Network Language Model

The goal of neural network language model is to estimate the conditional probability \(P(w_i|w_1, \ldots , w_{i-1})\). However, the feedforward neural network (FNN) lacks an effective way to represent the long-term historical context. Therefore, it adopts the idea of n-gram language models to approximate the conditional probability, which assumes that each word in a word sequence more statistically depends on those words closer to it, and only \(n-1\) context words are used to calculate the conditional probability, i.e., \(P(w_i|\mathbf {w}_{1}^{i-1})\approx P(w_i|\mathbf {w}_{i-n+1}^{i-1})\).

The overall architecture of the FNN language model is proposed by [3]. To evaluate the conditional probability of the word \(w_i\), it first projects its \(n-1\) context-related words to their word vector representations \(\mathbf {x} = [\mathbf {w}_{i-n+1}, \ldots , \mathbf {w}_{i-1}]\), and then feeds them into an FNN, which can be generally represented as

$$\begin{aligned} \mathbf {y}=\mathbf {M} f(\mathbf {W}\mathbf {x}+\mathbf {b}) +\mathbf {d}, \end{aligned}$$
(4.8)

where \(\mathbf {W}\) is a weighted matrix to transform word vectors to hidden representations, \(\mathbf {M}\) is a weighted matrix for the connections between the hidden layer and the output layer, and \(\mathbf {b}, \mathbf {d}\) are bias vectors. And then the conditional probability of the word \(w_i\) can be calculated as

$$\begin{aligned} P(w_i|\mathbf {w}_{i-n}^{i-1}) = \frac{\exp (\mathbf {y}_{w_i})}{\sum _{j}\exp (\mathbf {y}_j)}. \end{aligned}$$
(4.9)

4.4.2 Convolutional Neural Network Language Model

The Convolutional Neural Network (CNN) is the family of neural network models that features a type of layer known as the convolutional layer. This layer can extract features by a learnable filter (or kernel) at the different positions of an input. Pham et al. [47] propose the CNN language model to enhance the FNN language model. The proposed CNN network is produced by injecting a convolutional layer after the word input representation \(\mathbf {x} = [\mathbf {w}_{i-n}, \ldots , \mathbf {w}_{i-1}]\). Formally, the convolutional layer involves a sliding window of the input vectors centered on each word vector using a parameter matrix \(W_c\), which can be generally represented as

$$\begin{aligned} \mathbf {y}=\mathbf {M} \big ({\text {max}}(\mathbf {W_c}\mathbf {x})\big ), \end{aligned}$$
(4.10)

where \({\text {max}}(\cdot )\) indicates a max-pooling layer. The architecture of CNN is shown in Fig. 4.1.

Fig. 4.1
figure 1

The architecture of CNN

Moreover, [12] also introduces a convolutional neural network for language modeling with a novel gating mechanism.

4.4.3 Recurrent Neural Network Language Model

To address the lack of ability for modeling long-term dependency in the FNN language model, [41] proposes a Recurrent Neural Network (RNN) language model which applies RNN in language modeling. RNNs are fundamentally different from FNNs in the sense that they operate on not only an input space but also an internal state space, and the internal state space enables the representation of sequentially extended dependencies. Therefore, the RNN language model can deal with those sentences of arbitrary length. At every time step, its input is the vector of its previous word instead of the concatenation of vectors of its n previous words, and the information of all other previous words can be taken into account by its internal state. Formally, the RNN language model can be defined as

$$\begin{aligned} \mathbf {h}_i= & {} f(\mathbf {W}_1 \mathbf {h}_{i-1} + \mathbf {W}_2 \mathbf {w}_{i}+\mathbf {b}),\end{aligned}$$
(4.11)
$$\begin{aligned} \mathbf {y}= & {} \mathbf {M}\mathbf {h}_{i-1} + \mathbf {d}, \end{aligned}$$
(4.12)

where \(\mathbf {W}_1, \mathbf {W}_2, \mathbf {M}\) are weighted matrices and \(\mathbf {b}, \mathbf {d}\) are bias vectors. Here, the RNN unit can also be implemented by LSTM or GRU. The architecture of RNN is shown in Fig. 4.2.

Fig. 4.2
figure 2

The architecture of RNN

Recently, researchers make some comparisons among neural network language models with different architectures on both small and large corpora. The experimental results show that, generally, the RNN language model outperforms the CNN language model.

4.4.4 Transformer Language Model

In 2018, Google proposed a pre-trained language model (PLM), called BERT, which achieved state-of-the-art results on a variety of NLP tasks. At that time, it was very big news. Since then, all the NLP researchers began to consider how PLMs can benefit their research tasks.

In this section, we will first introduce the Transformer architecture and then talk about BERT and other PLMs in detail.

4.4.4.1 Transformer

Transformer [65] is a nonrecurrent encoder-decoder architecture with a series of attention-based blocks. For the encoder, there are 6 layers and each layer is composed of a multi-head attention sublayer and a position-wise feedforward sublayer. And there is a residual connection between sublayers. The architecture of the Transformer is as shown in Fig. 4.3.

Fig. 4.3
figure 3

The architecture of Transformer

There are several attention heads in the multi-head attention sublayer. A head represents a scaled dot-product attention structure, which takes the query matrix \(\mathbf {Q}\), the key matrix \(\mathbf {K}\), and the value matrix \(\mathbf {V}\) as the inputs, and the output is computed by

$$\begin{aligned} {\text {Attention}}(\mathbf {Q}, \mathbf {K}, \mathbf {V}) = {\text {Softmax}}\left( \frac{\mathbf {QK}^T}{\sqrt{d_k}}\right) \mathbf {V}, \end{aligned}$$
(4.13)

where \(d_k\) is the dimension of query matrix.

The multi-head attention sublayer linearly projects the input hidden states H several times into the query matrix, the key matrix, and the value matrix for h heads. The dimensions of the query, key, and value vectors are \(d_k\), \(d_k\), and \(d_v\), respectively. The multi-head attention sublayer could be formulated as

$$\begin{aligned} {\text {Multihead}}(H) = [head_1, head_2,\ldots ,head_h]\mathbf {W}^O, \end{aligned}$$
(4.14)

where \({head_i} = {\text {Attention}}(\mathbf {HW}_i^Q, \mathbf {HW}^K_i, \mathbf {HW}^V_i)\), and \(\mathbf {W}^Q_i\), \(\mathbf {W}^K_i\) and \(\mathbf {W}^V_i\) are linear projections. \(\mathbf {W}^O\) is also a linear projection for the output. Here, the fully connected position-wise feedforward sublayer contains two linear transformations with ReLU activation:

$$\begin{aligned} {\text {FFN}}(x) = \mathbf {W}_2\max (0, \mathbf {W}_1x + \mathbf {b}_1)+\mathbf {b}_2. \end{aligned}$$
(4.15)

Transformer is better than RNNs for modeling the long-term dependency, where all tokens will be equally considered during the attention operation. The Transformer was proposed to solve the problem of machine translation. Since Transformer has a very powerful ability to model sequential data, it becomes the most popular backbone of NLP applications.

4.4.4.2 Transformer-Based PLM

Neural models can learn large amounts of language knowledge from language modeling. Since the language knowledge covers the demands of many downstream NLP tasks and provides powerful representations of words and sentences, some researchers found that knowledge can be transferred to other NLP tasks easily. The transferred models are called Pre-trained Language Models (PLMs).

Language modeling is the most basic and most important NLP task. It contains a variety of knowledge for language understanding, such as linguistic knowledge and factual knowledge. For example, the model needs to decide whether it should add an article before a noun. This requires linguistic knowledge about articles. Another example is the question of what is the following word after “Trump is the president of”. The answer is “America”, which requires factual knowledge. Since language modeling is very complex, the models can learn a lot from this task.

On the other hand, language modeling only requires plain text without any human annotation. With this feature, the models can learn complex NLP abilities from a very large-scale corpus. Since deep learning needs large amounts of data and language modeling can make full use of all texts in the world, PLMs significantly benefit the development of NLP research.

Inspired by the success of the Transformer, GPT [50] and BERT [14] begin to adopt the Transformer as the backbone of the pre-trained language models. GPT and BERT are the most representative Transformer-based pre-trained language models (PLMs). Since they achieved state-of-the-art performance on various NLP tasks, nearly all PLMs after them are based on the Transformer. In this subsection, we will talk about GPT and BERT in more detail.

GPT is the first work to pretrain a PLM based on the Transformer. The training procedure of GPT [50] contains two classic stages: generative pretraining and discriminative fine-tuning.

In the pretraining stage, the input of the model is a large-scale unlabeled corpus denoted as \(\mathscr {U} = \{u_1, u_2,\ldots ,u_n\}\). The pretraining stage aims to optimize a language model. The learning objective over the corpus is to maximize a conditional likelihood in a fixed-size window:

$$\begin{aligned} \mathscr {L}_1(\mathscr {U}) = \sum _i \log P(u_i|u_{i-k}, \ldots , u_{i-1}; \varTheta ), \end{aligned}$$
(4.16)

where k represents the size of the window, the conditional likelihood P is modeled by a neural network with parameters \(\varTheta \).

For a supervised dataset \(\chi \), the input is a sequence of words \(s = (w_1, w_2,..,w_l)\) and the output is a label y. The pretraining stage provides an advantageous start point of parameters that can be used to initialize subsequent supervised tasks. At this occasion, the objective is a discriminative task that maximizes the conditional possibility distribution:

$$\begin{aligned} \mathscr {L}_2(\chi ) = \sum _{(s,y)} \log P(y|w_1, \ldots , w_l), \end{aligned}$$
(4.17)

where \(P(y|w_1, \ldots ,w_l)\) is modeled by a K-layer Transformer. After the input tokens pass through the pretrained GPT, a hidden vector of the final layer \(\mathbf {h}_l^K\) will be produced. To obtain the output distribution, a linear transformation layer is added, which has the same size as the number of labels:

$$\begin{aligned} P(y|w_1, \ldots , w_m) = {\text {Softmax}}(\mathbf {W}_y\mathbf {h}_l^K). \end{aligned}$$
(4.18)

The final training objective is combined with a language modeling \(\mathscr {L}_1\) for better generalization:

$$\begin{aligned} \mathscr {L}(\chi ) = \mathscr {L}_2(\chi ) + \lambda * \mathscr {L}_1(\chi ), \end{aligned}$$
(4.19)

where \(\lambda \) is a weight hyperparameter.

Fig. 4.4
figure 4

The pretraining and fine-tuning stages for BERT

BERT [14] is a milestone work in the field of PLM. BERT achieved significant empirical results on 17 different NLP tasks, including SQuAD (outperform human being), GLUE (7.7% point absolute improvement), MultiNLI (4.6% point absolute improvement), etc. Compared to GPT, BERT uses a bidirectional deep Transformer as the model backbone. As illustrated in Fig. 4.4, BERT contains pretraining and fine-tuning stages.

In the pretraining stage, two objectives are designed: Masked Language Model (MLM) and Next Sentence Prediction (NSP). (1) For MLM, tokens are randomly masked with a special token \(\mathrm{[MASK]}\). The training objective is to predict the masked tokens based on the contexts. Compared with the standard unidirectional conditional language model, which can only be trained in one direction, MLM aims to train a deep bidirectional representation model. This task is inspired by Cloze [64]. (2) The objective of NSP is to capture relationships between sentences for some sentence-based downstream tasks such as natural language inference (NLI) and question answering (QA). In this task, a binary classifier is trained to predict whether the sentence is the next sentence for the current. This task effectively captures the deep relationship between sentences, exploring semantic information from a different level.

After pretraining, BERT can capture various language knowledge for downstream supervised tasks. By modifying inputs and outputs, BERT can be fine-tuned for any NLP tasks, which contain the applications with the input of single text or text pairs. The input consists of sentence \(\mathrm{A}\) and sentence \(\mathrm{B}\), which can represent (1) sentence pairs in paraphrase, (2) hypothesis-premise pairs in entailment, (3) question-passage pairs in QA, and (4) text-\(\emptyset \) for text classification task or sequence tagging. For the output, BERT can produce the token-level representation for each token, which is used to sequence tagging task or question answering. Besides, the special token \([{\text {CLS}}]\) in BERT is fed into the classification layer for sequence classification.

4.4.4.3 PLM Family

Pre-trained language models have rapid progress after BERT. We summarize several important directions of PLMs and show some representative models and their relationship in Fig. 4.5.

Here is a brief introduction of the PLMs after BERT. Firstly, there are some variants of BERT for better general language representation, such as RoBERTa [38] and XLNet [70]. These models mainly focus on the improvement of pretraining tasks. Secondly, some people work on pretrained generation models, such as MASS [57] and UniLM [15]. These models achieve promising results on the generation tasks instead of the Natural Language Understanding (NLU) tasks used by BERT. Thirdly, the sentence pair format of BERT inspired works on the cross-lingual and cross-modal fields. XLM [8], ViLBERT [39], and VideoBERT [59] are the important works in this direction. Lastly, there are some works [46, 81] that explore to incorporate external knowledge into PLMs since some low-frequency knowledge cannot be efficiently learned by PLMs.

Fig. 4.5
figure 5

The Pre-trained language model family

4.4.5 Extensions

4.4.5.1 Importance Sampling

Inspired by the contrastive divergence model, [4] proposes to adopt importance sampling to accelerate the training of neural language models. They first normalize the outputs of neural network language model and view neural network language models as a special case of energy-based probability models as following:

$$\begin{aligned} P(w_i|\mathbf {w}_{i-n}^{i-1}) = \frac{\exp (-y_{w_i})}{\sum _{j}\exp (-y_j)}. \end{aligned}$$
(4.20)

The key idea of importance sampling is to approximate the mean of log-likelihood gradient of the loss function of neural network language model by sampling several important words instead of calculating the explicit gradient. Here, the log-likelihood gradient of the loss function of neural network language model can be generally represented as

$$\begin{aligned} \frac{\partial P(w_i|\mathbf {w}_{i-n}^{i-1})}{\partial \theta }= & {} -\frac{\partial y_{w_i}}{\partial \theta } + \sum _{j=1}^{|V|} P(w_j|\mathbf {w}_{i-n}^{i-1}) \frac{\partial y_j}{\partial \theta } \nonumber \\= & {} -\frac{\partial y_i}{\partial \theta }+ \mathbb {E}_{w_k\sim P}\left[ \frac{\partial y_{k}}{\partial \theta }\right] , \end{aligned}$$
(4.21)

where \(\theta \) indicates all parameters of the neural network language model. Here, the log-likelihood gradient of the loss function consists of two parts including positive gradient for target word \(w_i\) and negative gradient for all words \(w_j\), i.e., \(\mathbb {E}_{w_i\sim P}[\frac{\partial y_j}{\partial \theta }]\). Here, the second part can be approximated by sampling important words following the probability distribution P:

$$\begin{aligned} \mathbb {E}_{w_k\sim P}\left[ \frac{\partial y_k}{\partial \theta }\right] \approx \sum _{w_k \in V'} \frac{1}{|V'|} \frac{\partial y_{k}}{\partial \theta }, \end{aligned}$$
(4.22)

where \(V'\) is the word set sampled under P.

However, since we cannot obtain probability distribution P in advance, it is impossible to sample important words following the probability distribution P. Therefore, importance sampling adopts a Monte Carlo scheme which uses an existing proposal distribution Q to approximate P, and then we have

$$\begin{aligned} \mathbb {E}_{w_k\sim P}\left[ \frac{\partial y_k}{\partial \theta }\right] \approx \frac{1}{|V''|}\sum _{w_l\in V''} \frac{\partial y_{l}}{\partial \theta }P(w_l|\mathbf {w}_{i-n}^{i-1})/Q(w_l), \end{aligned}$$
(4.23)

where \(V''\) is the word set sampled under Q. Moreover, the sample size of importance sampling approach should be increased as training processes in order to avoid divergence, which aims to ensure its effective sample size S:

$$\begin{aligned} S = \frac{(\sum _{w_l\in V''}r_l)^2}{\sum _{w_l\in V''}r_l^2}, \end{aligned}$$
(4.24)

where \(r_l\) is further defined as

$$\begin{aligned} r_l = \frac{P(w_l|\mathbf {w}_{i-n}^{i-1})/Q(w_l)}{\sum _{w_j\in V''} P(w_j|\mathbf {w}_{i-n}^{i-1})/Q(w_j)}. \end{aligned}$$
(4.25)

4.4.5.2 Word Classification

Besides important sampling, researchers [7, 22] also propose class-based language model, which adopts word classification to improve the performance and speed of a language model. In class-based language model, all words are assigned to a unique class, and the conditional probability of a word given its context can be decomposed into the probability of the word’s class given its previous words and the probability of the word given its class and history, which is formally defined as

$$\begin{aligned} P(w_i | \mathbf {w}_{i-n}^{i-1}) =\sum _{c(w_i)\in C} P(w_i | c(w_i)) P (c(w_i) | \mathbf {w}_{i-n}^{i-1}), \end{aligned}$$
(4.26)

where C indicates the set of all classes and \(c(w_i)\) indicates the class of word \(w_i\).

Moreover, [44] proposes a hierarchical neural network language model, which extends word classification to a hierarchical binary clustering of words in a language model. Instead of simply assigning each word with a unique class, it first builds a hierarchical binary tree of words according to the word similarity obtained from WordNet. Next, it assigns a unique bit vector \(c(w_i) = [c_1(w_i), c_2(w_i), \ldots , c_l(w_i)]\) for each word, which indicates the hierarchical classes of them. And then the conditional probability of each word can be defined as

$$\begin{aligned} P(w_i | \mathbf {w}_{i-n}^{i-1}) = \prod _{j=0}^l P(c_j(w_i) | c_1(w_i), c_2(w_i), \ldots , c_{j-1}(w_i), \mathbf {w}_{i-n}^{i-1}). \end{aligned}$$
(4.27)

The hierarchical neural network language model can achieve \(O(k/\log k)\) speed up as compared to a standard language model. However, the experimental results of [44] show that although the hierarchical neural network language model achieves an impressive speed up for modeling sentences, it has worse performance than the standard language model. The reason is perhaps that the introduction of hierarchical architecture or word classes imposes a negative influence on the word classification by neural network language models.

4.4.5.3 Caching

Caching is also one of the important extensions in language model. A type of cache-based language model assumes that each word in recent context is more likely to appear again [58]. Hence, the conditional probability of a word can be calculated by the information from history and caching:

$$\begin{aligned} P(w_i | \mathbf {w}_{i-n}^{i-1}) = \lambda P_s(w_i | \mathbf {w}_{i-n}^{i-1}) + (1-\lambda ) P_c(w_i | \mathbf {w}_{i-n}^{i-1}), \end{aligned}$$
(4.28)

where \(P_s(w_i | \mathbf {w}_{i-n}^{i-1})\) indicates the conditional probability generated by standard language and \(P_c(w_i | \mathbf {w}_{i-n}^{i-1})\) indicates the conditional probability generated by caching, and \(\lambda \) is a constant.

Another cache-based language model is also used to speed up the RNN language modeling [27]. The main idea of this approach is to store the outputs and states of language models for future predictions given the same contextual history.

4.5 Applications

In this section, we will introduce two typical sentence-level NLP applications including text classification and relation extraction, as well as how to utilize sentence representation for these applications.

4.5.1 Text Classification

Text classification is a typical NLP application and has lots of important real-world tasks such as parsing and semantic analysis. Therefore, it has attracted the interest of many researchers. The conventional text classification models (e.g., the LDA [6] and tree kernel [48] models) focus on capturing more contextual information and correct word order by extracting more useful and distinct features, but still expose a few issues (e.g., data sparseness) which has the significant impact on the classification accuracy. Recently, with the development of deep learning in the various fields of artificial intelligence, neural models have been introduced into the text classification field due to their abilities of text representation learning. In this section, we will introduce the two typical tasks of text classification, including sentence classification and sentiment classification.

4.5.1.1 Sentence Classification

Sentence classification aims to assign a sentence an appropriate category, which is a basic task of the text classification application.

Considering the effectiveness of the CNN models in capturing sentence semantic meanings, [31] first proposes to utilize the CNN models trained on the top of pretrained word embeddings to classify sentences, which achieved promising results on several sentence classification datasets. Then, [30] introduces a dynamic CNN model to model the semantic meanings of sentences. This model handles sentences of varying lengths and uses dynamic max-pooling over linear sequences, which could help the model capture both short-range and long-range semantic relations in sentences. Furthermore, [9] proposes a novel CNN-based model named as Very Deep CNN, which operates directly at the character level. It shows that those deeper models have better results on sentence classification and can capture the hierarchical information from scattered characters to whole sentences. Yin and Schütze [74] also propose MV-CNN, which utilizes multiple types of pretrained word embeddings and extracts features from multi-granular phrases with variable-sized convolutional layers. To address the drawbacks of MV-CNN such as model complexity and the requirement for the same dimension of embeddings, [80] proposes a novel model called MG-CNN to capture multiple features from multiple sets of embeddings that are concatenated at the penultimate layer. Zhang et al. [79] present RA-CNN to jointly exploit labels on documents and their constituent sentences, which can estimate the probability that a given sentence is informative and then scales the contribution of each sentence to aggregate a document representation in proportion to the estimates.

The RNN model which aims to capture the sequential information of sentences is also widely used in sentence classification. Lai et al. [32] propose a neural network for text classification, which applies a recurrent structure to capture contextual information. Moreover, [37] introduces a multitask learning framework based on the RNN to jointly learn across multiple sentence classification tasks, which employs three different mechanisms of sharing information to model sentences with both task-specific and shared layers. Yang et al. [71] introduce word-level and sentence-level attention mechanisms into an RNN-based model as well as a hierarchical structure to capture the hierarchical information of documents for sentence classification.

4.5.1.2 Sentiment Classification

Sentiment classification is a special task of the sentence classification application, whose objective is to classify the sentimental polarities of opinions a piece of text contains, e.g., favorable or unfavorable, positive or negative. This task appeals the NLP community since it has lots of potential downstream applications such as movie review suggestions.

Similar to text classification, the sentence representation based on neural models has also been widely explored for sentiment classification. Glorot et al. [20] use a stacked denoising autoencoder in sentiment classification for the first time. Then, a series of recursive neural network models based on the recursive tree structure of sentences are conducted to learn sentence representations for sentiment classification, including the recursive autoencoder (RAE) [55], matrix-vector recursive neural network (MV-RNN) [54], and recursive neural tensor network (RNTN) [56]. Besides, [29] adopts a CNN to learn sentence representations and achieves promising performance in sentiment classification.

The RNN models also benefit sentiment classification as they are able to capture the sequential information. Li et al. [35] and Tai et al. [62] investigate a tree-structured LSTM model on text classification. There are also some hierarchical models proposed to deal with document-level sentiment classification [5, 63], which generate semantic representations at different levels (e.g., phrase, sentence, or document) within a document. Moreover, the attention mechanism is also introduced into sentiment classification, which aims to select important words from a sentence or important sentences from a document [71].

4.5.2 Relation Extraction

To enrich existing KGs, researchers have devoted many efforts to automatically finding novel relational facts in text. Therefore, relation extraction (RE), which aims at extracting relational facts according to semantic information in plain text, has become a crucial NLP application. As RE is also an important downstream application of sentence representation, we will, respectively, introduce the techniques and extensions to show how to utilize sentence representation for different RE scenarios. Considering neural networks have become the backbone of the recent NLP research, we mainly focus on Neural RE (NRE) models in this section.

4.5.2.1 Sentence-Level NRE

Fig. 4.6
figure 6

An example of sentence-level relation extraction

Sentence-level NRE aims at predicting the semantic relations between the given entity (or nominal) pair in a sentence. As shown in Fig. 4.6, given the input sentence s which consists of n words \(s = \{ w_1, w_2, \ldots , w_n \}\) and its corresponding entity pair \(e_1\) and \(e_2\) as input, sentence-level NRE wants to obtain the conditional probability \(P(r|s, e_1, e_2)\) of relation r (\(r \in \mathscr {R}\)) via a neural network, which can be formalized as

$$\begin{aligned} P(r|s, e_1, e_2) = P(r|s, e_1, e_2, \theta ), \end{aligned}$$
(4.29)

where \(\theta \) is all parameters of the neural network and r is a relation in the relation set \(\mathscr {R}\).

A basic form of sentence-level NRE consists of three components: (a) an input encoder to give a representation for each input word, (b) a sentence encoder which computes either a single vector or a sequence of vectors to represent the original sentence, and (c) a relation classifier which calculates the conditional probability distribution of all relations.

Input Encoder. First, a sentence-level NRE system projects the discrete words of the source sentence into a continuous vector space, and obtains the input representation \(\mathbf {w} = \{\mathbf {w}_1, \mathbf {w}_2, \ldots , \mathbf {w}_m\}\) of the source sentence.

  1. (1)

    Word Embeddings. Word embeddings aim to transform words into distributed representations to capture the syntactic and semantic meanings of the words. In the sentence s, every word \(w_i\) is represented by a real-valued vector. Word representations are encoded by column vectors in an embedding matrix \(\mathbf {E} \in \mathbb {R}^{d^a\times |V|} \) where V is a fixed-sized vocabulary. Although word embeddings are the most common way to represent input words, there are also efforts made to utilize more complicated information of input sentences for RE.

  2. (2)

    Position Embeddings. In RE, the words close to the target entities are usually informative to determine the relation between the entities. Therefore, position embeddings are used to help models keep track of how close each word is to the head or tail entities. It is defined as the combination of the relative distances from the current word to the head or tail entities. For example, in the sentence Bill_Gates is the founder of Microsoft., the relative distance from the word founder to the head entity Bill_Gates is \(-3\) and the tail entity Microsoft is 2. Besides word position embeddings, more linguistic features are also considered in addition to the word embeddings to enrich the linguistic features of the input sentence.

  3. (3)

    Part-of-speech (POS) Tag Embeddings. POS tag embeddings are to represent the lexical information of the target word in the sentence. Because word embeddings are obtained from a large-scale general corpus, the general information they contain may not be in accordance with the meaning in a specific sentence. Hence, it is necessary to align each word with its linguistic information considering its specific context, e.g., noun and verb. Formally, each word \(w_i\) is encoded by the corresponding column vector in an embedding matrix \(\mathbf {E}^p \in \mathbb {R}^{d^p\times |V^p|}\), where \(d^p\) is the dimension of embedding vector and \(V^p\) indicates a fixed-sized POS tag vocabulary.

  4. (4)

    WordNet Hypernym Embeddings. WordNet hypernym embeddings aim to take advantages of the prior knowledge of hypernym to help RE models. When given the hypernym information of each word in WordNet (e.g., noun.food and verb.motion), it is easier to build the connections between different but conceptually similar words. Formally, each word \(w_i\) is encoded by the corresponding column vector in an embedding matrix \(\mathbf {E}^h \in \mathbb {R}^{d^h\times |V^h|}\), where \(d^h\) is the dimension of embedding vector and \(V^h\) indicates a fixed-sized hypernym vocabulary.

For each word, the NRE models often concatenate some of the above four feature embeddings as their input embeddings. Therefore, the feature embeddings of all words are concatenated and denoted as a final input sequence \(\mathbf {w} = \{\mathbf {w}_1, \mathbf {w}_2, \ldots , \mathbf {w}_m\} \), where \(\mathbf {w}_i\in \mathbb {R}^d\), d is the total dimension of all feature embeddings concatenated for each word.

Sentence Encoder. The sentence encoder is the core for sentence representation, which encodes input representations into either a single vector or a sequence of vectors \(\mathbf {x}\) to represent sentences. We will introduce the different sentence encoders in the following.

(1) Convolutional Neural Network Encoder. Zeng et al. [76] propose to encode input sentences using a CNN model, which extracts local features by a convolutional layer and combines all local features via a max-pooling operation to obtain a fixed-sized vector for the input sentence. Formally, a convolutional layer is defined as an operation on a vector sequence \(\mathbf {w}\):

$$\begin{aligned} \mathbf {p} = {\text {CNN}}(\mathbf {w}), \end{aligned}$$
(4.30)

where \({\text {CNN}}\) indicates the convolution operation inside the convolutional layer.

And the ith element of the sentence vector \(\mathbf {x}\) can be calculated as follows:

$$\begin{aligned}{}[\mathbf {x}]_i = f(\max (\mathbf {p}_i)), \end{aligned}$$
(4.31)

where f is a nonlinear function applied at the output, such as the hyperbolic tangent function.

Further, PCNN [75], which is a variation of CNN, adopts a piece-wise max-pooling operation. All hidden vectors \(\{\mathbf {p}_1, \mathbf {p}_2, \ldots \}\) are divided into three segments by the head and tail entities. The max-pooling operation is performed over the three segments separately, and the \(\mathbf {x}\) is the concatenation of the pooling results over the three segments.

(2) Recurrent Neural Network Encoder. Zhang and Wang [78] propose to embed input sentences using an RNN model which can learn the temporal features. Formally, each input word representation is put into recurrent layers step by step. For each step i, the network takes the ith word representation vector \(\mathbf {w}_{i}\) and the output of the previous \(i-1\) steps \(\mathbf {h}_{i-1}\) as input:

$$\begin{aligned} \mathbf {h}_i = {\text {RNN}}(\mathbf {w}_{i}, \mathbf {h}_{i-1}), \end{aligned}$$
(4.32)

where \({\text {RNN}}\) indicates the transform function inside the RNN cell, which can be the LSTM units or the GRU units mentioned before.

The conventional RNN models typically deal with text sequences from start to end, and build the hidden state of each word only considering its preceding words. It has been verified that the hidden state considering its following words is more effective. Hence, the bi-directional RNN (BRNN) [52] is adopted to learn hidden states using both preceding and following words.

Similar to the previous CNN models in RE, the RNN model combines the output vectors of the recurrent layer as local features, and then uses a max-pooling operation to extract the global feature, which forms the representation of the whole input sentence. The max-pooling layer could be formulated as

$$\begin{aligned}{}[\mathbf {x}]_{j} = \max \limits _{i}{{[\mathbf {h}_{i}]_{j}}}. \end{aligned}$$
(4.33)

Besides max-pooling, word attention can also combine all local feature vectors together. The attention mechanism [1] learns attention weights on each step. Supposing \(\mathbf {H} = [\mathbf {h}_{1}, \mathbf {h}_{2}, \ldots , \mathbf {h}_{m}]\) is the matrix consisting of all output vectors produced by the recurrent layer, the feature vector of the whole sentence \(\mathbf {x}\) is formed by a weighted sum of these output vectors:

$$\begin{aligned} \alpha= & {} \text {Softmax}(\mathbf {s}^{\top }\tanh (\mathbf {H})),\end{aligned}$$
(4.34)
$$\begin{aligned} \mathbf {x}= & {} \mathbf {H}\alpha ^{\top }, \end{aligned}$$
(4.35)

where \(\mathbf {s}\) is a trainable query vector.

Besides, [42] proposes a model that captures information from both word sequence and tree-structured dependency by stacking bidirectional path-based LSTM-RNNs (i.e., bottom-up and top-down). More specifically, it focuses on the shortest path between the two target entities in the dependency tree, and utilizes the stacked layers to encode the shortest path for the whole sentence representation. In fact, some preliminary work [69] has shown that these paths are useful in RE, and various recursive neural models are also proposed for this. Next, we will introduce these recursive models in detail.

(3) Recursive Neural Network Encoder. The recursive encoder aims to extract features from the information of syntactic parsing trees, considering the syntactic information is beneficial for extracting relations from sentences. Generally, these encoders treat the tree structure inside syntactic parsing trees as a strategy of composition as well as a direction to combine each word feature.

Socher et al. [54] propose a recursive matrix-vector model (MV-RNN) which captures the structure information by assigning a matrix-vector representation for each constituent of the constituents in parsing trees. The vector captures the meaning of the constituent itself and the matrix represents how it modifies the meaning of the word it combines with. Tai et al. [62] further propose two types of tree-structured LSTMs including the Child-Sum Tree-LSTM and the N-ary Tree-LSTM to capture tree structure information. For the Child-Sum Tree-LSTM, given a tree, let C(t) denote the set of children of node t. Its transition equations are defined as follows:

$$\begin{aligned} \hat{\mathbf {h}}_t= & {} \sum _{k\in C(t)} {\text {TLSTM}}(\mathbf {h}_k), \end{aligned}$$
(4.36)

where \({\text {TLSTM}}(\cdot )\) indicates a Tree-LSTM cell, which is simply modified from LSTM cell. The N-ary Tree-LSTM has similar transition equations as the Child-Sum Tree-LSTM. The only difference is that it limits the tree structures to have at most N branches.

Relation Classifier. When obtaining the representation \(\mathbf {x}\) of the input sentence, relation classifier calculates the conditional probability \(P(r|x, e_1, e_2)\) via a softmax layer as follows:

$$\begin{aligned} P(r|x, e_1, e_2) = \text {Softmax}(\mathbf {Mx}+\mathbf {b}), \end{aligned}$$
(4.37)

where \(\mathbf {M}\) indicates the relation matrix and \(\mathbf {b}\) is a bias vector.

4.5.2.2 Bag-Level NRE

Although existing neural models have achieved great success for extracting novel relational facts, it always suffers the lack of training data. To address this issue, researchers proposed a distant supervision assumption to generate training data via aligning KGs and plain text automatically. The intuition of distant supervision assumption is that all sentences that contain two entities will express their relations in KGs. For example, (New York, city_of, United States) is a relational fact in a KG, distant supervision assumption will regard all sentences that contain these two entities as positive instances for the relation city_of. It offers a natural way of utilizing information from multiple sentences (bag-level) rather than a single sentence (sentence-level) to decide if a relation holds between two entities.

Fig. 4.7
figure 7

An example of bag-level relation extraction

Therefore, bag-level NRE aims to predict the semantic relations between an entity pair using all involved sentences. As shown in Fig. 4.7, given the input sentence set S which consists of n sentences \(S = \{s_1, s_2, \ldots , s_n\}\) and its corresponding entity pair \(e_1\) and \(e_2\) as inputs, bag-level NRE wants to obtain the conditional probability \(P(r|S, e_1, e_2)\) of relation r (\(r\in \mathbb {R}\)) via a neural network, which can be formalized as

$$\begin{aligned} P(r|S, e_1, e_2) = P(r|S, e_1, e_2, \theta ). \end{aligned}$$
(4.38)

A basic form of bag-level NRE consists of four components: (a) an input encoder similar to sentence-level NRE, (b) a sentence encoder similar to sentence-level NRE, (c) a bag encoder which computes a vector representing all related sentences in a bag, and (d) a relation classifier similar to sentence-level NRE which takes bag vectors as input instead of sentence vectors. As the input encoder, sentence encoder, and relation classifier of bag-level NRE are similar to the ones of sentence-level NRE, we will thus mainly focus on introducing the bag encoder in detail.

Bag Encoder. The bag encoder encodes all sentence vectors into a single vector \(\mathbf {S}\). We will introduce the different bag encoders in the following:

(1) Random Encoder. It simply assumes that each sentence can express the relation between two target entities and randomly select one sentence to represent the bag. Formally, the bag representation is defined as

$$\begin{aligned} \mathbf {S} = \mathbf {s}_{i}\ (i \in \{1, 2, \ldots , n\}), \end{aligned}$$
(4.39)

where \(\mathbf {s}_i\) indicates the sentence representation of \(s_i \in S\) and i is a random index.

(2) Max Encoder. As introduced above, not all sentences containing two target entities can express their relations. For example, the sentence New York City is the premier gateway for legal immigration to the United States does not express the relation city of. Hence, in [75], they follow the at-least-one assumption which assumes that at least one sentence that contains these two target entities can express their relations, and select the sentence with the highest probability for the relation to represent the bag. Formally, bag representation is defined as

$$\begin{aligned} \mathbf {S} = \mathbf {s}_{i}\ (i = \arg \max _i P(r|s_i, e_1, e_2) ). \end{aligned}$$
(4.40)

(3) Average Encoder. Both random encoder or max encoder use only one sentence to represent the bag, which ignores the rich information of different sentences. To exploit the information of all sentences, [36] believes that the representation \(\mathbf {S}\) of the bag depends on all sentences’ representations. Each sentence representation \(\mathbf {s}_i\) can give the relation information about two entities to a certain extent. The average encoder assumes that all sentences contribute equally to the representation of the bag. It means the embedding \(\mathbf {S}\) of the bag is the average of all the sentence vectors:

$$\begin{aligned} \mathbf {S} = \sum _i \frac{1}{n} \mathbf {s}_i. \end{aligned}$$
(4.41)

(4) Attentive Encoder. Due to the wrong label issue brought by distant supervision assumption inevitably, the performance of average encoder will be influenced by those sentences that contain no relation information. To address this issue, [36] further proposes to employ a selective attention to reduce those noisy sentences. Formally, the bag representation is defined as a weighted sum of sentence vectors:

$$\begin{aligned} \mathbf {S} = \sum _i \alpha _i \mathbf {s}_i, \end{aligned}$$
(4.42)

where \(\alpha _i\) is defined as

$$\begin{aligned} \alpha _i = \frac{\exp ( \mathbf {s}^\top _i \mathbf {A} \mathbf {r})}{\sum _j\exp ( \mathbf {x}^\top _j \mathbf {A} \mathbf {r})}, \end{aligned}$$
(4.43)

where \(\mathbf {A}\) is a diagonal matrix and \(\mathbf {r}\) is the representation vector of relation r.

Relation Classifier. Similar to sentence-level NRE, when obtaining the bag representation \(\mathbf {S}\), relation classifier also calculates the conditional probability \(P(r|S, e_1, e_2)\) via a softmax layer as follows:

$$\begin{aligned} P(r|S, e_1, e_2) = {\text {Softmax}}(\mathbf {MS}+\mathbf {b}), \end{aligned}$$
(4.44)

where \(\mathbf {M}\) indicates the relation matrix and \(\mathbf {b}\) is a bias vector.

4.5.2.3 Extensions

Recently, NRE systems have achieved significant improvements in both, the supervised and distantly supervised scenarios. However, there are still many challenges in the task of RE, and many researchers have been focusing on other aspects to improve the performance of NRE as well. In this section, we will introduce these extensions in detail.

Utilization of External Information. Most existing NRE systems stated above only concentrate on the sentences which are extracted, regardless of the rich external information such as KGs. This heterogeneous information could provide additional knowledge from KG and is essential when extracting new relational facts.

Han et al. [24] propose a novel joint representation learning framework for knowledge acquisition. The key idea is that the joint model learns knowledge and text representations within a unified semantic space via KG-text alignments. For the text part, the sentence with two entities Mark Twain and Florida is regarded as the input for a CNN encoder, and the output of CNN is considered to be the latent relation PlaceOfBirth of this sentence. For the KG part, entity and relation representations are learned via translation-based methods. The learned representations of KG and text parts are aligned during training. Besides this preliminary attempt, many efforts have been devoted to this direction [25, 28, 51, 67, 68].

Incorporating Relational Paths. Although existing NRE systems have achieved promising results, they still suffer a major problem: the models can only directly learn from those sentences which contain both two-target entities. However, those sentences containing only one of the entities could also provide useful information and help build inference chains. For example, if we know that “A is the son of B” and “B is the son of C”, we can infer that A is the grandson of C.

To utilize the information of both direct and indirect sentences, [77] introduces a path-based NRE model that incorporates textual relational paths. The model first employs a CNN encoder to embed the semantic meanings of sentences. Then, the model builds a relation path encoder, which measures the probability of relations given an inference chain in the text. Finally, the model combines information from both direct sentences and relational paths, and then predicts the confidence of each relationship. This work is the first effort to consider the knowledge of relation path in text for NRE, and there are also several methods later to consider the reasoning path of sentence semantic meanings for RE [11, 19].

Fig. 4.8
figure 8

An example of document-level relation extraction

Fig. 4.9
figure 9

An example from DocRED [72]

Document-level Relation Extraction. In fact, not all relational facts can be extracted by sentence-level RE, i.e., a large number of relational facts are expressed in multiple sentences. Taking Fig. 4.9 as an example, multiple entities are mentioned in the document and exhibit complex interactions. In order to identify the relational fact (Riddarhuset, country, Sweden), one has to first identify the fact that Riddarhuset is located in Stockholm from Sentence 4, then identify the facts Stockholm is the capital of Sweden and Sweden is a country from Sentence 1. With the above facts, we can finally infer that the sovereign state of Riddarhuset is Sweden. This process requires reading and reasoning over multiple sentences in a document, which is intuitively beyond the reach of sentence-level RE methods. According to the statistics on a human-annotated corpus sampled from Wikipedia documents [72], at least \(40.7\%\) relational facts can only be extracted from multiple sentences, which is not negligible. Swampillai and Stevenson [61] and Verga et al. [66] also report similar observations. Therefore, it is necessary to move RE forward from the sentence level to the document level. Figure 4.8 is an example for document-level RE.

However, existing datasets for document-level RE either only have a small number of manually annotated relations and entities [34], or exhibit noisy annotations from distant supervision [45, 49], or serve specific domains or approaches [33]. To address this issue, [72] constructs a large-scale, manually annotated, and general-purpose document-level RE dataset, named as DocRED. DocRED is constructed from Wikipedia and Wikidata, and has two key features. First, DocRED contains 132, 375 entities and 56, 354 relational facts annotated on 5, 053 Wikipedia documents, which is the largest human-annotated document-level RE dataset now. Second, over \(40\%\) of the relational facts in DocRED can only be extracted from multiple sentences. This makes DocRED require reading multiple sentences in a document to recognize entities and inferring their relations by synthesizing all information of the document.

The experimental results on DocRED show that the performance of existing sentence-level RE methods declines significantly on DocRED, indicating the task document-level RE is more challenging than sentence-level RE and remains an open problem. It also relates to the document representation which will be introduced in the next chapter.

Fig. 4.10
figure 10

An example of few-shot relation extraction

Few-shot Relation Extraction.

As we mentioned before, the performance of the conventional RE models [23, 76] heavily depend on time-consuming and labor-intensive annotated data, which make themselves hard to generalize well. Although adopting distant supervision is a primary approach to alleviate this problem, the distantly supervised data also exhibits a long-tail distribution, where most relations have very limited instances. Furthermore, distant supervision suffers the wrong labeling problem, which makes it harder to classify long-tail relations. Hence, it is necessary to study training RE models with insufficient training instances. Figure 4.10 is an example for few-shot RE.

FewRel [26] is a new large-scale supervised few-shot RE dataset, which requires models capable of handling classification task with a handful of training instances, as shown in Table 4.1. Benefiting from the FewRel dataset, there are some efforts to exploring few-shot RE [17, 53, 73] and achieve promising results. Yet, few-shot RE still remains a challenging problem for further research [18].

Table 4.1 An example for a 3 way 2 shot scenario. Different colors indicate different entities, underline for head entity, and emphasize for tail entity

4.6 Summary

In this chapter, we introduce sentence representation learning. Sentence representation encodes the semantic information of a sentence into a real-valued representation vector, and can be utilized in further sentence classification or matching tasks. First, we introduce the one-hot representation for sentences and probabilistic language models. Secondly, we extensively introduce several neural language models, including adopting the feedforward neural networks, the convolutional neural networks, the recurrent neural networks, and the Transformer for language models. These neural models can learn rich linguistic and semantic knowledge from language modeling. Benefiting from this, the pre-trained language models trained with large-scale corpora have achieved state-of-the-art performance on various downstream NLP tasks by transferring the learned semantic knowledge from general corpora to the target tasks. Finally, we introduce several typical applications of sentence representation including text classification and relation extraction.

For further understanding of sentence representation learning and its applications, there are also some recommended surveys and books including

  • Yoav, Neural network methods for natural language processing [21].

  • Deng & Liu, Deep learning in natural language processing [13].

In the future, for better sentence representation, some directions are requiring further efforts:

  1. (1)

    Exploring Advanced Architectures. The improvement of model architectures is the key factor in the success of sentence representation. From the feedforward neural networks to the Transformer, people are designing more suitable neural models for sequential inputs. Based on the Transformer, some researchers are working on new NLP architectures. For instance, Transformer-XL [10] is proposed to solve the problem of fixed-length context in the Transformer. Since the Transformer is the state-of-the-art NLP architecture, current works mainly adopt attention mechanisms. Beyond these works, is it possible to introduce more human cognitive mechanisms to neural models?

  2. (2)

    Modeling Long Documents. The representation of long documents is an important extension of sentence representation. There are some new challenges during modeling long documents, such as discourse analysis and co-reference resolution. Although some existing works already provide document-level NLP tasks (e.g., DocRED [72]), the model performance on these tasks is still much lower than the human performance. We will also introduce the advances in document representation learning in the following chapter.

  3. (3)

    Performing Efficient Representation. Although the combination of Transformer and large-scale data leads to very powerful sentence representation, these representation models require expensive computational cost, which limits the applications in downstream tasks. Some existing works explore to use model compression techniques for more efficient models. These techniques include knowledge distillation [60], parameter pruning [16], etc. Beyond these works, there remain lots of unsolved problems for developing better representation models, which can efficiently learn from large-scale data and provide effective vectors in downstream tasks.