Keywords

1.1 Scope of the Book

With the development of efficient Deep Learning models about a decade ago, many Deep Neural Networks have been used to solve pattern recognition tasks such as natural language processing (NLP) and image processing. Typically, the models have to capture the meaning of a text or an image and make an appropriate decision. Alternatively they can generate a new text or image according to the task at hand. An advantage of these models is that they create intermediate features arranged in layers and do not require manually constructed features. Deep Neural Networks such as Convolutional Neural Networks (CNNs) [32] and Recurrent Neural Networks (RNNs) [65] use low-dimensional dense vectors as a kind of distributed representation to express the syntactic and semantic features of language.

All these models can be considered as Artificial Intelligence (AI) Systems. AI is a broad research field aimed at creating intelligent machines, acting similar to humans and animals having natural intelligence. It captures the field’s long-term goal of building machines that mimic and then surpass the full spectrum of human cognition. Machine Learning(ML) is a subfield of artificial intelligence that employs statistical techniques to give machines the capability to ‘learn’ from data without being given explicit instructions on what to do. This process is also called ‘training’, whereby a ‘learning algorithm’ gradually improves the model’s performance on a given task. Deep Learning is an area of ML in which an input is transformed in layers step by step in such a way that complex patterns in the data can be recognized. The adjective ‘deep’ refers to the large number of layers in modern ML models that help to learn expressive representations of data to achieve better performance.

In contrast to computer vision, the size of annotated training data for NLP applications was rather small, comprising only a few thousand sentences (except for machine translation). The main reason for this was the high cost of manual annotation. To avoid overfitting, i.e. overadapting models to random fluctuations, only relatively small models could be trained, which did not yield high performance. In the last 5 years, new NLP methods have been developed based on the Transformer introduced by Vaswani et al. [67]. They represent the meaning of each word by a vector of real numbers called embedding. Between these embeddings various kinds of “attentions” can be computed, which can be considered as a sort of “correlation” between different words. In higher layers of the network, attention computations are used to generate new embeddings that can capture subtle nuances in the meaning of words. In particular, they can grasp different meanings of the same word that arise from context. A key advantage of these models is that they can be trained with unannotated text, which is almost infinitely available, and overfitting is not a problem.

Currently, there is a rapid development of new methods in the research field, which makes many approaches from earlier years obsolete. These models are usually trained in two steps: In a first pre-training step, they are trained on a large text corpus containing billions of words without any annotations. A typical pre-training task is to predict single words in the text that have been masked in the input. In this way, the model learns fine subtleties of natural language syntax and semantics. Because enough data is available, the models can be extended to many layers with millions or billions of parameters.

In a second fine-tuning step, the model is trained on a small annotated training set. In this way, the model can be adapted to new specific tasks. Since the fine-tuning data is very small compared to the pre-training data and the model has a high capacity with many millions of parameters, it can be adapted to the fine-tuning task without losing the stored information about the language structure. It was demonstrated that this idea can be applied to most NLP tasks, leading to unprecedented performance gains in semantic understanding. This transfer learning allows knowledge from the pre-training phase to be transferred to the fine-tuned model. These models are referred to as Pre-trained Language Models (PLM).

In the last years the number of parameters of these PLMs was systematically enlarged together with more training data. It turned out that in contrast to conventional wisdom the performance of these models got better and better without suffering from overfitting. Models with billions of parameters are able to generate syntactically correct and semantically consistent fluent text if prompted with some starting text. They can answer questions and react meaningful to different types of prompts.

Moreover, the same PLM architecture can simultaneously be pre-trained with different types of sequences, e.g. tokens in a text, image patches in a picture, sound snippet of speech, image patch sequences in video frames, DNA snippets, etc. They are able to process these media types simultaneously and establish connections between the different modalities. They can be adapted via natural language prompts to perform acceptably on a wide variety of tasks, even though they have not been explicitly trained on these tasks. Because of this flexibility, these models are promising candidates to develop overarching applications. Therefore, large PLMs with billions of parameters are often called Foundation Models [9].

This book is intended to provide an up-to-date overview of the current Pre-trained Language Models and Foundation Models, with a focus on applications in NLP:

  • We describe the necessary background knowledge, model architectures, pre-training and fine-tuning tasks, as well as evaluation metrics.

  • We discuss the most relevant models for each NLP application group that currently have the best accuracy or performance, i.e. are close to the state of the art (Sota). Our purpose here is not to describe a spectrum of all models developed in recent years, but to explain some representative models so that their internal workings can be understood.

  • Recently PLMs have been applied to a number of speech, image and video processing tasks giving rise to the term Foundation Models. We give an overview of most relevant models, which often allow the joint processing of different media, e.g. text and images

  • We provide links to available model codes and pre-trained model parameters.

  • We discuss strengths and limitations of the models and give an outlook on possible future developments.

There are a number of previous surveys of Deep Learning and NLP [1,2,3,4, 10, 15, 16, 27, 39, 50, 53, 54, 59, 66]. The surveys of Han et al. [22], Lin et al. [41], and Kalyan et al. [31] are the most up-to-date and comprehensive. Jurafsky and Martin [30] prepare an up-to-date book on this field. In addition, there are numerous surveys for specific model variants or application areas. Where appropriate, we provide references to these surveys. New terminology is usually printed in italics and models in bold.

The rest of this chapter introduces text preprocessing and classical NLP models, which in part are reused inside PLMs. The second chapter describes the main architectures of Pre-trained Language Models, which are currently the workhorses of NLP. The third chapter considers a large number of PLM variants that extend the capabilities of the basic models. The fourth chapter describes the information captured by PLMs and Foundation Models and analyses their syntactic skills, world knowledge, and reasoning capabilities.

The remainder of the book considers various application domains and identifies PLMs and Foundation Models that currently provide the best results in each domain at a reasonable cost. The fifth chapter reviews information extraction methods that automatically identify structured information and language features in text documents, e.g. for relation extraction. The sixth chapter deals with natural language generation approaches that automatically generate new text in natural language, usually in response to a prompt. The seventh chapter is devoted to models for analyzing and creating multimodal content that typically integrate content understanding and production across two or more modalities, such as text, speech, image, video, etc. The general trend is that more data, computational power, and larger parameter sets lead to better performance. This is explained in the last summary chapter, which also considers social and ethical aspects of Foundation Models and summarizes possible further developments.

1.2 Preprocessing of Text

The first step in preprocessing is to extract the actual text. For each type of text document, e.g. pdf, html, xml, docx, ePUB, there are specific parsers, which resolve the text into characters, words, and formatting information. Usually, the layout and formatting information is removed.

Then, the extracted text is routinely divided into tokens, i.e. words, numbers, and punctuation marks. This process is not trivial, as text usually contains special units like phone numbers or email addresses that must be handled in a special way. Some text mining tasks require the splitting of text into sentences. Tokenizers and sentence splitters for different languages have been developed in the past decades and can be included from many programming toolboxes, e.g. Spacy [64].

In the past, many preprocessing methods aimed at generating new relevant features (part-of-speech tags, syntax parse trees) and removing unnecessary tokens (stemming, stop word removal, lemmatization). In most cases, this is no longer necessary with modern approaches that internally automatically derive the features relevant for the task at hand.

In an optional final step, the word-tokens can be further subdivided and rearranged. A simple technique creates charactern-grams (i.e. all sequences of n adjacent characters in a word) as additional features. Alternatively, wordn-grams can be formed consisting of n consecutive words.

Currently, the most popular approach tries to limit the number of different words in a vocabulary. A common choice is byte-pair encoding [19]. This method first selects all characters as tokens. Then, successively the most frequent token pair is merged into a new token and all instances of the token pair are replaced by the new token. This is repeated until a vocabulary of prescribed size is obtained. Note that new words can always be represented by a sequence of vocabulary tokens and characters. Common words end up being a part of the vocabulary, while rarer words are split into components, which often retain some linguistic meaning. In this way, out-of-vocabulary words are avoided.

The WordPiece [69] algorithm also starts by selecting all characters of the collection as tokens. Then it assumes that the text corpus has been generated by randomly sampling tokens according to their observed frequencies. It merges tokens a and b (inside words) in such a way that the likelihood of the training data is maximally increased [60]. There is a fast variant whose computational complexity is linear in the input length [63]. SentencePiece [35] is a package containing several subword tokenizers and can also be applied to all Asian languages. All the approaches effectively interpolate between word level inputs for frequent words and character level inputs for infrequent words.

Often the language of the input text has to be determined [29, 57]. Most language identification methods extract character n-grams from the input text and evaluate their relative frequencies. Some methods can be applied to texts containing different languages at the same time [42, 71]. To filter out offensive words from a text, one can use lists of such toxic words in different languages [62].

1.3 Vector Space Models and Document Classification

To apply Machine Learning to documents, their text has to be transformed into scalars, vectors, matrices, or higher-dimensional arrangements of numbers, which are collectively called tensors. In the previous section, text documents in a corpus were converted into a sequence of tokens by preprocessing. These tokens now have to be translated into tensors.

The bag-of-words representation describes a given text document d by a vector x of token counts. The vocabulary is a list of all different tokens contained in the collection of training documents, the training corpus. Ignoring the order of tokens, this bag-of-words vector records how often each token of the vocabulary appears in document d. Note that most vector entries will be zero, as each document will only contain a small fraction of vocabulary tokens. The vector of counts may be modified to emphasize tokens with high information content, e.g. by using the tf-idf statistic [43]. Table 1.1 summarizes different representations for documents used for NLP.

Table 1.1 Representations for documents used in NLP Models.

Document classification methods aim to categorize text documents according to their content [33, 61]. An important example is the logistic classifier, which uses a bag-of-words vector x as input and predicts the probability of each of the k possible output classes y ∈{1, …, k}. More precisely, there is a random variable Y  which may take the values 1, …, k. To predict the output class y from the input x, a score vector is first generated as

$$\displaystyle \begin{aligned} \boldsymbol{u}=A{\boldsymbol{x}}+\boldsymbol{b} {} \end{aligned} $$
(1.1)

using an affine transformation of the input x. Here, the vector x is transformed by a linear transformationAx and then a bias vector b is added. The resulting score vectoru of length k is then transformed to a probability distribution over the k classes by the softmax function

$$\displaystyle \begin{aligned} \operatorname{\mathrm{softmax}}(u_1,\ldots, u_k) &= \frac{(\exp(u_1),\ldots, \exp(u_k))}{\exp(u_1)+\cdots+ \exp(u_k)} {}, \end{aligned} $$
(1.2)
$$\displaystyle \begin{aligned} p(Y\mkern1.5mu{=}\mkern1.5mu m|{\boldsymbol{x}};A,\boldsymbol{b}) &= \operatorname{\mathrm{softmax}}(A{\boldsymbol{x}}+\boldsymbol{b}) {}. \end{aligned} $$
(1.3)

Since the softmax function converts any vector into a probability vector, we obtain the conditional probability of output class m as a function of input x. The function

$$\displaystyle \begin{aligned} \text{LRM}({\boldsymbol{x}})=\operatorname{\mathrm{softmax}}(A{\boldsymbol{x}}+\boldsymbol{b}) {} \end{aligned} $$
(1.4)

is called a logistic classifier model [48] with parameter vector w = vec(A, b). In general, a function mapping the input x to the output y or a probability distribution over the output is called a modelf(x;w).

The model is trained using training dataTr = {(x[1], y[1]), …, (x[N], y[N])}, whose examples (x[i], y[i]) have to be independent and identically distributed (i.i.d.). The task is to adjust the parameters w such that the predicted probability p(Y =m|x;w) is maximized. Following the Maximum Likelihood principle, this can be achieved by modifying the parameter vector w such that the complete training data has a maximal probability [24, p. 31]

$$\displaystyle \begin{aligned} \max_{\boldsymbol{w}}p(y^{[1]}|{\boldsymbol{x}}^{[1]};{\boldsymbol{w}})*\cdots*p(y^{[N]}|{\boldsymbol{x}}^{[N]};{\boldsymbol{w}}). \end{aligned} $$
(1.5)

Transforming the expression by log and multiplying by − 1.0 gives the classification loss function LMC(w), also called maximum entropy loss.

$$\displaystyle \begin{aligned} L_{\text{MC}}({\boldsymbol{w}})=-\left[\log p(y^{[1]}|{\boldsymbol{x}}^{[1]};{\boldsymbol{w}})+\cdots+\log p(y^{[N]}|{\boldsymbol{x}}^{[N]};{\boldsymbol{w}})\right]. {} \end{aligned} $$
(1.6)

To optimize the loss function, its gradient is computed and minimized by stochastic gradient descent or another optimizer (c.f. Sect. 2.4.1).

The performance of classifiers is measured on separate test data by accuracy, precision, recall, F1-value, etc. [21, p. 410f]. Because the bag-of-words representation ignores important word order information, document classification by a logistic classifier is less commonly used today. However, this model is still a component in most Deep Learning architectures.

1.4 Nonlinear Classifiers

It turns out that the logistic classifier partitions the input space by linear hyperplanes that are not able to solve more complex classification tasks, e.g., the XOR problem [47]. An alternative is to generate an internal hidden vectorh by an additional affine transformationA1x + b1 followed by a monotonically non-decreasing nonlinear activation functiong and use this hidden vector as input for the logistic classifier to predict the random variable Y

$$\displaystyle \begin{aligned} {\boldsymbol{h}} &= g(A_1{\boldsymbol{x}}+\boldsymbol{b}_1) , {} \end{aligned} $$
(1.7)
$$\displaystyle \begin{aligned} p(Y\mkern1.5mu{=}\mkern1.5mu m|{\boldsymbol{x}}; {\boldsymbol{w}}) &= \operatorname{\mathrm{softmax}}(A_2{\boldsymbol{h}}+\boldsymbol{b}_2) , {} \end{aligned} $$
(1.8)

where the parameters of this model can be collected in a parameter vector w = vec(A1, b1, A2, b2). The form of the nonlinear activation function g is quite arbitrary, often \(\tanh (x)\) or a rectified linear unit\(\text{ReLU}(x)=\max (0,x)\) is used. Fcl(x) = g(A1x + b1) is called a fully connected layer.

This model (Fig. 1.1) is able to solve any classification problem arbitrarily well, provided the length of h is large enough [21, p. 192]. By prepending more fully connected layers to the network we get a Deep Neural Network, which needs fewer parameters than a shallow network to approximate more complex functions. Historically, it has been called Multilayer Perceptron (MLP). Liang et al. [40] show that for a large class of piecewise smooth functions, the sizes of hidden vectors needed by a shallow network to approximate a function is exponentially larger than the corresponding number of neurons needed by a deep network for a given degree of function approximation.

Fig. 1.1
An illustration of a neural network has input nodes of 0.1, 1.3, and negative 0.4 linked to six blocks of u 1, Relu of u 1, hidden vector, via A 2 h + b 2 to 4 blocks of u 2, soft max of u 2, and ends with 4 output probabilities of 0.2, 0.3, 0.1, and 0.4.

A neural network for classification transforms the input by layers with affine transformations and nonlinear activation functions, e.g. ReLU. The final layer usually is a logistic classifier

The support vector machine [14] follows a different approach and tries to create a hyperplane, which is located between the training examples of the two classes in the input space. In addition, this hyperplane should have a large distance (margin) to the examples. This model reduces overfitting and usually has a high classification accuracy, even if the number of input variables is high, e.g. for document classification [28]. It was extended to different kernel loss criteria, e.g. graph kernels [56] which include grammatical features. Besides SVM, many alternative classifiers are used, such as random forests [24, p.588f] and gradient boosted trees [24, p.360], which are among the most popular classifiers.

For these conventional classifiers the analyst usually has to construct input features manually. Modern classifiers for text analysis are able to create relevant features automatically (Sect. 2.1). For the training of NLP models there exist three main paradigms:

  • Supervised training is based on training data consisting of pairs (x, y) of an input x, e.g. a document text, and an output y, where y usually is a manual annotation, e.g. a sentiment. By optimization the unknown parameters of the model are adapted to predict the output from the input in an optimal way.

  • Unsupervised training just considers some data x and derives some intrinsic knowledge from unlabeled data, such as clusters, densities, or latent representations.

  • Self-supervised training selects parts of the observed data vector as input x and output y. The key idea is to predict y from x in a supervised manner. For example, the language model is a self-supervised task that attempts to predict the next token vt+1 from the previous tokens v1, …, vt. For NLP models, this type of training is used very often.

1.5 Generating Static Word Embeddings

One problem with bag-of word representations is that frequency vectors of tokens are unable to capture relationships between words, such as synonymy and homonymy, and give no indication of their semantic similarity. An alternative are more expressive representations of words and documents based on the idea of distributional semantics [58], popularized by Zellig Harris [23] and John Firth [18]. According to Firth “a word is characterized by the company it keeps”. This states that words occurring in the same neighborhood tend to have similar meanings.

Based on this idea each word can be characterized by a demb-dimensional vector ), a word embedding. Usually, a value between 100 and 1000 is chosen for demb. These embeddings have to be created such that words that occur in similar contexts have embeddings with a small vector distance, such as the Euclidean distance. A document then can be represented by a sequence of such embeddings. It turns out that words usually have a similar meaning, if their embeddings have a low distance. Embeddings can be used as input for downstream text mining tasks, e.g. sentiment analysis. Goldberg [20] gives an excellent introduction to static word embeddings. The embeddings are called static embeddings as each word has a single embedding independent of the context.

There are a number of different approaches to generate word embeddings in an unsupervised way. Collobert et al. [13] show that word embeddings obtained by predicting neighbor words can be used to improve the performance of downstream tasks such as named entity recognition and semantic role labeling.

Word2vec [45] predicts the words in the neighborhood of a central word with an extremely simple model. As shown in Fig. 1.2 it uses the embedding vector of the central word as input for a logistic classifier (1.3) to infer the probabilities of words in the neighborhood of about five to seven positions. The training target is to forecast all neighboring words in the training set with a high probability. For training, Word2Vec repeats this prediction for all words of a corpus, and the parameters of the logistic classifier as well as the values of the embeddings are optimized by stochastic gradient descent to improve the prediction of neighboring words.

Fig. 1.2
An architecture has input words of Biden, has, been, U S, president, since, and 2021. These are linked to embedding vector. The word U S via logistic classifier SoftMax, gives 4 word probabilities, that point to the words, has, been, president, and since.

Word2vec predicts the words in the neighborhood of a central word by logistic classifier L. The input to L is the embedding of the central word. By training with a large set of documents, the parameters of L as well as the embeddings are learned [54, p. 2]

The vocabulary of a text collection contains k different words, e.g. k = 100, 000. To predict the probability of the i-th word by softmax (1.2), k exponential terms \(\exp (u_i)\) have to be computed. To avoid this effort, the fraction is approximated as

$$\displaystyle \begin{aligned} \frac{\exp(u_i)}{\exp(u_1)+\cdots+exp(u_k)} \approx \frac{\exp(u_i)}{\exp(u_i)+\sum_{j\in S} exp(u_j)}, {} \end{aligned} $$
(1.9)

where S is a small sample of, say, 10 randomly selected indices of words. This technique is called noise contrastive estimation [21, p. 612]. There are several variants available, which are used for almost all classification tasks involving softmax computations with many classes. Since stochastic gradient descent works with noisy gradients, the additional noise introduced by the approximation of the softmax function is not harmful and can even help the model escape local minima. The shallow architecture of Word2Vec proved to be far more efficient than previous architectures for representation learning.

Word2Vec embeddings have been used for many downstream tasks, e.g. document classification. In addition, words with a similar meaning may be detected by simply searching for words whose embeddings have a small Euclidean distance to the embedding of a target word. The closest neighbors of “neutron”, for example, are “neutrons”, “protons”, “deuterium”, “positron”, and “decay”. In this way, synonyms can be revealed. Projections of embeddings on two dimensions may be used for the exploratory analysis of the content of a corpus. GloVe generates similar embedding vectors using aggregated global word-word co-occurrence statistics from a corpus [51].

It turns out that differences between the embeddings often have an interpretation. For example, the result of emb(Germany) − emb(Berlin) + emb(Paris) has emb(France) as its nearest neighbor with respect to Euclidean distance. This property is called analogy and holds for a majority of examples of many relations such as capital-country, currency-country, etc. [45].

FastText [8] representations enrich static word embeddings by using subword information. Character n-grams of a given length range, e.g., 3–6, are extracted from each word. Then, embedding vectors are defined for the words as well as their character n-grams. To train the embeddings all word and character n-gram embeddings in the neighborhood of a central word are averaged, and the probabilities of the central word and its character n-grams are predicted by a logistic classifier. To improve the probability prediction, the parameters of the model are optimized by stochastic gradient descent. This is repeated for all words in a training corpus. After training, unseen words can be reconstructed using only their n-gram embeddings. Starspace [68] was introduced as a generalization of FastText. It allows embedding arbitrary entities (such as authors, products) by analyzing texts related to them and evaluating graph structures. An alternative are spherical embeddings, where unsupervised word and paragraph embeddings are constrained to a hypersphere [44].

1.6 Recurrent Neural Networks

Recurrent Neural Networks were developed to model sequences v1, …, vT of varying length T, for example the tokens of a text document. Consider the task to predict the next token vt+1 given the previous tokens (v1, …, vt). As proposed by Bengio et al. [6] each token vt is represented by an embedding vector xt = emb(vt) indicating the meaning of vt. The previous tokens are characterized by a hidden vector ht, which describes the state of the subsequence (v1, …, vt−1). The RNN is a function Rnn(ht, xt) predicting the next hidden vector ht+1 by

$$\displaystyle \begin{aligned} {\boldsymbol{h}}_{t+1}=\text{RNN}({\boldsymbol{h}}_t , {\boldsymbol{x}}_t). \end{aligned} $$
(1.10)

Subsequently, a logistic classifier (1.3) with parameters H and g predicts a probability vector for the next token vt+1 using the information contained in ht+1,

$$\displaystyle \begin{aligned} p(V_{t+1}|v_1,\ldots,v_t)=\operatorname{\mathrm{softmax}}(H*{\boldsymbol{h}}_{t+1}+\boldsymbol{g}), \end{aligned} $$
(1.11)

as shown in Fig. 1.3. Here Vt is the random variable of possible tokens at position t. According to the definition of the conditional probability the joint probability of the whole sequence can be factorized as

$$\displaystyle \begin{aligned} p(v_1,\ldots,v_T) = p(V_T\mkern1.5mu{=}\mkern1.5mu v_T|v_1,\ldots,v_{T-1})* \cdots* p(V_2\mkern1.5mu{=}\mkern1.5mu v_2|v_1)*p(V_1\mkern1.5mu{=}\mkern1.5mu v_1). {} \end{aligned} $$
(1.12)

A model that either computes the joint probability or the conditional probability of natural language texts is called language model as it potentially covers all information about the language. A language model sequentially predicting the next word by the conditional probability is often referred to autoregressive language model. According to (1.12), the observed tokens (v1, …, vt) can be used as input to predict the probability of the next token Vt+1. The product of these probabilities yields the correct joint probability of the observed token sequence (v1, …, vT). The same model Rnn(h, x) is repeatedly applied and generates a sequence of hidden vectors ht. A simple RNN just consists of a single fully connected layer

$$\displaystyle \begin{aligned} \text{RNN}({\boldsymbol{h}}_t , {\boldsymbol{x}}_t) = \tanh \left(A*\begin{bmatrix} {\boldsymbol{h}}_t\\ {\boldsymbol{x}}_t\end{bmatrix}+{\boldsymbol{b}}\right). \end{aligned} $$
(1.13)

The probabilities of the predicted words v1, …, vT depend on the parameters w = vec(H, g, A, b, emb(v1), …, emb(vT)). To improve these probabilities, we may use the stochastic gradient descent optimizer (Sect. 2.4.1) and adapt the unknown parameters in w. Note that this also includes the estimation of new token embeddings emb(vt). A recent overview is given in [70, Ch. 8–9].

Fig. 1.3
An architecture has input words of, the, cat, sat, on, and the. These are linked to embedding vector, hidden vector, logistic classifier, token probabilities, and ends with tokens to be predicted which are, the, cat, sat, on, the, mat.

The RNN starts on the left side and successively predicts the probability of the next token with the previous tokens as conditions using a logistic classifier L. The hidden vector ht stores information about the tokens that occur before position t

It turns out that this model has difficulties to reconstruct the relation between distant sequence elements, since gradients tend to vanish or “explode” as the sequences get longer. Therefore, new RNN types have been developed, e.g. the Long Short-Term Memory (LSTM) [26] and the Gated Recurrent Unit (GRU) [11], which capture long-range dependencies in the sequence much better.

Besides predicting the next word in a sequence, RNNs have been successfully applied to predict properties of sequence elements, e.g. named entity recognition [36] and relation extraction [38]. For these applications bidirectional RNNs have been developed, consisting of a forward and a backward language model. The forward language model starts at the beginning of a text and predicts the next token, while the backward language model starts at the end of a text and predicts the previous token. Bidirectional LSTMs are also called biLSTMs. In addition, multilayer RNNs were proposed [72], where the hidden vector generated by the RNN-cell in one layer is used as the input to the RNN-cell in the next layer, and the last layer provides the prediction of the current task.

Machine translation from one language to another is an important application of RNNs [5]. In this process, an input sentence first is encoded by an encoder RNN as a hidden vector hT. This hidden vector is in turn used by a second decoder RNN as an initial hidden vector to generate the words of the target language sentence. However, RNNs still have difficulties to capture relationships over long distances between sequence elements because RNNs do not cover direct relations between distant sequence elements.

Attention was first used in the context of machine translation to communicate information over long distances. It computes the correlation between hidden vectors of the decoder RNN and hidden vectors of the encoder RNN at different positions. This correlation is used to build a context vector as a weighted average of relevant encoder hidden vectors. Then, this context vector is exploited to improve the final translation result [5]. The resulting translations were much better than those with the original RNN. We will see in later sections that attention is a fundamental principle to construct better NLP model.

ELMo [52] generates embeddings with bidirectional LSTM language models in several layers. The model is pre-trained as forward and backward language model with a large non-annotated text corpus. During fine-tuning, averages of the hidden vectors are used to predict the properties of words based on an annotated training set. These language models take into account the words before and after a position, and thus employ contextual representations for the word in the central position. For a variety of tasks such as sentiment analysis, question answering, and textual entailment, ELMo was able to improve Sota performance.

1.7 Convolutional Neural Networks

Convolutional Neural Networks (CNNs) [37] are widely known for their success in the image domain. They start with a small quadratic arrangement of parameters called filter kernel, which is moved over the input pixel matrix of the image. The values of the filter kernel are multiplied with the underlying pixel values and generate an output value. This is repeated for every position of the input pixel matrix. During training the parameters of a filter kernel are automatically tuned such that they can detect local image patterns such as blobs or lines. Each layer of the network, which is also called convolution layer, consists of many filter kernels and a network contains a number of convolution layers. Interspersed max pooling layers perform a local aggregation of pixels by maximum. The final layer of a Convolutional Neural Network usually is a fully connected layer with a softmax classifier.

Their breakthrough was AlexNet [34], which receives the RGB pixel matrix of an image as input and is tasked with assigning a content class to the image. This model won the 2012 ImageNet competition, where images had to be assigned to one of 1000 classes, and demonstrated the superior performance of Deep Neural Networks. Even earlier the deep CNN of Cireşan et al. [12] achieved Sota performance on a number of image classification benchmarks. A highly successful CNN is ResNet [25] which employs a so-called residual connection working as a bypass. It can circumvent many layers in the beginning of the training and is the key to training neural networks with many hundred layers. It resulted in image classifiers which have a higher accuracy than humans.

While Recurrent Neural Networks were regarded as the best way to process sequential input such as text, some CNN-based architectures were introduced, which achieved high performance on some NLP tasks. Kim [32] proposed a rather shallow CNN for sentence classification. It contains an embedding layer, a convolutional layer, a max-pooling layer, and a fully connected layer with softmax output. 1-D convolutions were applied to the embeddings of the input words, basically combining the information stored in adjacent words, treating them as n-grams. The embeddings are processed by a moving average with trainable weights. Using this architecture for classification proved to be very efficient, having a similar performance as recurrent architectures that are more difficult to train.

Another interesting CNN architecture is wavenet [49], a deeper network used mainly for text-to-speech synthesis. It consists of multiple convolutional layers stacked on top of each other, with its main ingredient being dilated causal convolutions. Causal means that the convolutions at position t can only utilize prior information x1, …, xt−1. Dilated means that the convolutions can skip input values with a certain step size k, i.e. that in some layer the features at position t are predicted using information from positions t, t − k, t − 2k, …. This step size k is doubled in each successive layer, yielding dilations of size k0, k1, k2, …. In this way, very long time spans can be included in the prediction. This model architecture has been shown to give very good results for text-to-speech synthesis.

1.8 Summary

Classical NLP has a long history, and machine learning models have been used in the field for several decades. They all require some preprocessing steps to generate words or tokens from the input text. Tokens are particularly valuable because they form a dictionary of finite size and allow arbitrary words to be represented by combination. Therefore, they are used by most PLMs. Early document representations like bag-of-words are now obsolete because they ignore sequence information. Nevertheless, classifiers based on them like logistic classifiers and fully connected layers, are important building blocks of PLMs.

The concept of static word embeddings initiated the revolution in NLP, which is based on contextual word embeddings. These ideas are elaborated in the next chapter. Recurrent neural networks have been used to implement the first successful language models, but were completely superseded by attention-based models. Convolutional neural networks for image processing are still employed in many applications. PLMs today often have a similar performance on image data, and sometimes CNNs are combined with PLMs to exploit their respective strengths, as discussed in Chap. 7.