Keywords

In this chapter we describe Foundation Models, i.e. large Pre-trained Language Models for generating new text in different application areas.

  • Document Retrieval systems accept a query and return an ordered list of text documents from a document collection, often evaluating the similarity of embeddings to retrieve relevant text passages (Sect. 6.1).

  • Question Answering systems are given a natural language question and must provide an answer, usually in natural language (Sect. 6.2).

  • Machine Translation takes a text in one language and generates a translation into another language (Sect. 6.3).

  • Text Summarization receives a long document and has to write a short summary covering the most important contents of the document (Sect. 6.4).

  • Text Generation uses an autoregressive Language Model to generate a longer story, usually starting from an initial text input (Sect. 6.5).

  • Dialog systems have the task of conducting a dialog with a human partner, typically not limited to a specific topic (Sect. 6.6).

Due to the large number of different approaches, we focus on representative models which exhibit a high performance at the time of writing. We review the current best techniques for each area, measured against appropriate benchmarks and taking into account the computational resources required. For standard models a link to the description in earlier chapters is provided. Examples for each application area are shown in Table 6.1.

Table 6.1 Language generation tasks illustrated by an example

6.1 Document Retrieval

Information retrieval (IR) uses computer systems to search databases for content. The resulting IR system is often called a search engine. Often, the user formulates a sentence or a query about to some topic, and the system is expected to return a sorted list of documents relevant to the query (ad hoc retrieval). Here we focus on retrieving textual information from a stored collection of documents. In contrast to question answering approaches in Sect. 6.2, the system does not generate a direct answer to the query in natural language.

Former IR systems were keyword-based: all words contained in a document were stored in an inverted index. The retrieval algorithm searched the index to identify documents that contained the query words. Then, these documents were ranked according to the information content of each query word found in a document, e.g. measured by tf-idf or BM25 [186]. These two steps are shown in Fig. 6.1. A survey of earlier retrieval techniques is given by Abbasiyantaeb and Momtazi [2]. However, this approach had three major problems:

  • Many objects, activities, or events may be expressed by different words called synonyms, e.g. “drink” and “beverage” or “buy” and “purchase”. The documents containing alternative words are not returned by keyword retrieval. Paraphrases like “he has tons of stuff to throw away” and “he needs to get rid of a lot of junk” are even harder to spot and were ignored. This is called the vocabulary mismatch problem.

    Fig. 6.1
    A flow diagram represents the sequence of processes starting from the text database and going through the inverted keyword index, initial keyword retrieval, candidate texts, and P L M tracker, before generating the ranked list.

    Retrieve-and-rerank architecture using PLMs. First, texts are retrieved from the document collection, usually with exact-match bag-of-words queries. These candidates are then reranked using PLM embeddings, e.g. from BERT. Image adapted from [123], reprinted with kind permission of authors

  • Many words have different meanings depending on the context (e.g. “rock”: music or stone). These words are called homonyms. Part of the retrieved documents containing such a word will be mismatches.

  • The order of words is often crucial for the meaning of the sentences (e.g. “dog kills person” vs. “person kills dog”). This is usually ignored with keyword search.

As an alternative, contextual embeddings were used to represent queries and documents. By identifying matching documents through comparison of contextual semantic representations, word meaning differences between documents and queries can be reduced and texts with synonyms, homonyms, and paraphrases can be retrieved. These models have achieved Sota results on various retrieval benchmarks [137] and have recently been introduced in commercial search engines. They are therefore one of the most commercially important applications of PLMs to date.

6.1.1 Dense Retrieval

Dense retrieval methods encode text as an embedding vector with a fixed length much smaller than the text length. Whether a document is relevant to a given query is determined by the similarity of embedding vectors, which is computed by cosine similarity or inner products. Unlike question answering (Sect. 6.2), these models do not generate a direct natural language response to a search query, but return complete documents or text passages. Recently, dense retrieval methods based on PLMs outperformed their keyword counterparts when fine-tuned on a small set of in-domain relevance-labeled documents. Lin et al. [124] provide a comprehensive overview of retrieval systems with PLMs. Different approaches for dense retrieval can be distinguished and are covered in the next sections:

  • Cross-Encoder: Use the concatenated query and a document as input to BERT and determine the relevance of the document for the query (Sect. 6.1.3).

  • Retrieval with token embeddings: The tokens of the query and the document are encoded by contextual embeddings. Then different metrics are used to compare these embeddings and to collect relevant documents (Sect. 6.1.4).

  • Retrieval with passage embeddings: These techniques encode the query and passages of the document by an embedding. Subsequently, these embeddings are compared. This type of embedding respects word order and thus has the potential to return better matches (Sect. 6.1.5).

Only a very small selection of methods can be described, which should give an impression of the approaches currently used as shown in Table 6.2. In Sects. 6.2.2 and 6.2.3 retrieval techniques for question answering are discussed, which are even more powerful. A very comprehensive survey on PLMs for retrieval is provided by Lin et al. [124].

Table 6.2 Document retrieval models with their performance. Benchmarks (Sect. 6.1.2): MARCO: MS-MARCO [16], NQuest: Natural Questions benchmark [109], Wiki65K: long Wikipedia documents [247]

6.1.2 Measuring Text Retrieval Performance

There are a number of benchmark datasets used for training and comparing retrieval approaches. The MS-MARCO benchmark [16] is a large-scale collection created from about half a million anonymized questions sampled from Bing’s search query logs. For the passage ranking task it contains a corpus of 8.8M passages with an average length of 55 words extracted from 3.6M web documents. The goal is to retrieve passages that answer the question. The training set contains approximately 500k pairs of queries and relevant documents, and another 400M pairs of queries and non-relevant documents. There is a development set and a secret test set with about each 7k queries. However, there is a discussion that the gold annotation of the MS-MARCO benchmark is biased to some extent [10].

The Natural Questions (NQ) [109] contains questions with at least 8 words from real users to the Google search engine. It requires QA systems to read and comprehend an entire Wikipedia article, which may or may not contain the answer to the question. An example is the question “Where is blood pumped after it leaves the right ventricle?” The task is to retrieve a long answer, i.e. a paragraph from the page that answers the question, e.g. “From the right ventricle, blood is pumped through the semilunar pulmonary valve …”, or an indication that there is no answer. The task was designed to be close to an end-to-end question answering application. One to five answers are provided by human annotators. While the original Natural Questions benchmark was a reading comprehension task providing a number of evidence documents for each question, the EfficientQA benchmark [147] adapted this to open-domain QA by taking examples with up to five token answers and discarding the evidence documents.

Min et al. [146] note that over half of the queries in Natural Questions are ambiguous, with many sources of ambiguity such as event and entity references. They develop an AmbigQA with reformulated questions that yield a unique answer.

A simple evaluation measure is the top-k accuracy, the proportion of queries for which one of the k most likely answers returned is correct. More complex is the mean reciprocal rank (MRR), the inverse of the rank of the first correct answer and 0, if no correct answer was returned. If, for instance, the third answer is correct, the reciprocal rank is 1∕3. The MRR for |Q| queries is

$$\displaystyle \begin{aligned} MRR = \frac 1{|Q|}\sum_{i=1}^{|Q|}\frac 1{rank_i}. \end{aligned} $$
(6.1)

MRR@m indicates that always an ordered list of m documents is returned.

We may define Pr(i) as the precision reached by the first i elements of the list of size m, i.e. the fraction of relevant documents of the first i. Then we may define the average precision as

$$\displaystyle \begin{aligned} AP = \frac 1m \sum_{i=1}^m Pr(i) * rel(i) \qquad MAP = \frac 1{|Q|}\sum_{j=1}^{|Q|} AP_j \end{aligned} $$
(6.2)

where rel(i) = 1 if the i-th document is relevant and 0 otherwise. The mean average precision (MAP) is the average of AP over |Q| different queries.

6.1.3 Cross-Encoders with BERT

monoBERT [155] performs reranking based on a fine-tuned BERT classifier based on the embedding of the [CLS] token. Query and document are combined to the input “[CLS] < query>  [SEP] < document>  [SEP]”. This is processed by a BERT fine-tuned on MS-MARCO, where the embedding of [CLS] in the last layer is used by a logistic classifier to predict the probability that the current document is relevant for the query. This output score is used for ranking (Fig. 6.2). Note that by this technique paraphrases like “symptoms of influenza include fever and nasal congestion” and “a stuffy nose and elevated temperature are signs you may have the flu” may be identified.

Fig. 6.2
An illustration indicates the position, segment, tokens, sum of embeddings, output embeddings, and probability of relevance of the query and passage going through the BERT layers with masked self-attentions.

The monoBERT model uses a fine-tuned BERT model for ranking passages with respect to queries. The input contains the query concatenated with the passage. The [CLS] token embedding is trained to return the probability that the passage answers the query

On the MS-MARCO benchmark [153] monoBERT yields an MRR@10 value of 35.9% (i.e. the first relevant document at position 2.8 on average). As the keyword-based BM25-search before had an MRR@10-value of 16.5% (first relevant document at position 6.1 on average), this result was a dramatic increase in performance of search engines. Such a big jump in effectiveness caused by an individual model is rarely observed in either academia or industry, which led to immediate excitement in the community.

It is quite striking how monoBERT provides a simple yet effective solution to the problem of text ranking (at least for texts that are shorter than its maximal input length) [124]. In several studies monoBERT has been found to be better than BM25 in estimating relevance when term frequency is held constant. Using textual manipulation tests that alter existing documents, rearranging the order of words within a sentence or across sentences was found to have a large negative effect, while shuffling the order of sentences within a document has a modest negative effect. In contrast, rearranging only prepositions had little effect. Experimental results from input template variations show that monoBERT uses exact match, “soft” semantic matches, and information about the position of words. Exactly how these different components are combined—for different types of queries, across different corpora, and under different settings, etc.—remains an open question. Note that this search approach requires enormous computational resources, as for each passage a new evaluation has to be performed, while the effort for index search grows only logarithmically.

monoT5 [154] used the T5 encoder-decoder model instead of BERT to rerank retrieved documents. The model receives the input “Query: < query>  Document: < document>  Relevant:”. monoT5 is fine-tuned to produce the tokens true or false if the document is relevant to the query or not. The predicted probability of true can be used as a relevance score. For T5 with 3B parameters the authors get an MRR@10-value of 38% for MS-MARCO passage retrieval. This shows that larger models increase performance of retrieval systems.

6.1.4 Using Token Embeddings for Retrieval

The all-to-all nature of the BERT attention patterns at each transformer encoder layer means that there is a quadratic complexity in terms of time and space with respect to the input length. In Sect. 3.2 we have introduced a number of approaches to cope with longer inputs. These all can be used to process longer documents. Among the many approaches we discuss ColBERT and Model 1 in more detail.

ColBERT [99] reranks the output of another (cheaper) retrieval model, typically a term-based model, or directly for end-to-end retrieval from a document collection. Queries and documents were prepended by different special tokens. ColBERT uses a single pre-trained BERT model to encode each query or document into a bag of token embeddings. In a final layer the size of embeddings is reduced and they are normalized to Euclidean length 1.0. Hence, the inner product is equivalent to the cosine similarity. If (q1, …, qm) are the query tokens and di,1, …, di,k are the tokens of the i-th document, the similarity of q and di is computed as

$$\displaystyle \begin{aligned} s_{q,d_i} = \sum_{r=1}^m \max_j {\boldsymbol{\eta}}(q_r)^\intercal{\boldsymbol{\eta}}(d_{i,j}). \end{aligned} $$
(6.3)

This is the sum of maximum cosine similarities (MaxSim) between each query term and the “best” matching term contained in the document di. For each query embedding the L2-nearest 10 embeddings are taken into account and k = 1000 closest document vectors are retrieved.

For ranking a preliminary search result of, say 1000 documents, the maximum similarities (e.g. cosine similarity) between all query embeddings and all embeddings in the retrieved documents are computed. This approach is very efficient as it requires orders of magnitude fewer FLOPS than previous approaches. On the MS-MARCO benchmark [153] a reranking ColBERT achieves a MRR@10-value of 34.9% (first relevant document at position 2.9 on average), which is slightly below the cross-encoder monoBERT.

ColBERT can also be used for end-to-end retrieval. It employs the FAISS index [91] to store the document token embeddings for a k-nearest neighbor search in a preparatory step. Note that for each token in each document an embedding has to be stored, as the embedding depends on the context. The retrieval requires two stages: in the first stage, a number of approximate searches for each query token is performed. In the second refinement stage, these approximate matches are reranked according to the MaxSim criterion. On the MS-MARCO benchmark the end-to-end retrieval by ColBERT has a MRR@10-value of 36.7%, which is much better than the reranking performance and on par with the much more expensive BERT cross-encoder approach.

Model 1 [28] mixes a number of techniques for their retrieval model based on token embeddings. First the authors estimate the probability p(q|d) that the query q has been generated as a “translation” of the document d. Using Bayes rule the authors get

$$\displaystyle \begin{aligned} p({\boldsymbol{d}}|{\boldsymbol{q}})\propto p({\boldsymbol{q}}|{\boldsymbol{d}})p({\boldsymbol{d}})\propto p({\boldsymbol{q}}|{\boldsymbol{d}}) \end{aligned} $$
(6.4)

assuming a uniform prior p(d) [21]. They consider the probability r(qi|dj) that a query token qi is a translation of a document token dj. Approximating r(qi|dj) by a neural network, they use embeddings of tokens qi and dj as inputs and are able to estimate p(d|q). The approach requires little computational effort. The authors combined the BERT dense retriever with a Lucene search index. Finally, they expand documents for Model 1 with Doc2query. Doc2query [156] aims at generating queries, for which the document is relevant. The approach trains a transformer to generate up to 100 query tokens from a document of up to 400 tokens. The model is trained using datasets consisting of pairs of query and relevant documents, e.g. MS-MARCO. On MS-MARCO they achieve 39.1% MRR@100. The context-free neural Model 1 is less effective than a BERT-based ranking model, but it can run efficiently on a CPU (without expensive index-time precomputation or query-time operations on large tensors).

Currently, no retriever tries to process long documents. This has many important applications like news recommendation, related article recommendation and paper citation suggestion. Usually, long documents are partitioned into passages with the idea that the relevant contents is contained in a passage. Note that PLMs with longer inputs, e.g. BigBird, can improve performance (Sect. 3.2). However, it is clear that this has to be evaluated. The SMITH model [247] uses a BERT-based hierarchical encoder to capture the document structure information. The document is first partitioned into sentences and for each sentence token embeddings are computed. Each sentence starts with an [CLS] token, whose embedding represents the sentence. There is a higher sentence level BERT which just receives the sentence embeddings as input. The first artificial token of second level BERT is used as the embedding of the whole document.

The model is pre-trained by the masked language modeling task to get token embeddings. In addition, in the second level there is a masked sentence block prediction task where the model has to select the correct embedding from all sentence embeddings in a batch. The fine-tuning task maximizes the relevance score predicted from the document embedding by a logistic classifier for the relevance-annotated fine-tuning dataset. On the Wiki65K with long Wikipedia articles [87] the approach achieves an accuracy of 95.9% which is a significant improvement over prior approaches.

6.1.5 Dense Passage Embeddings and Nearest Neighbor Search

Representing text passages by embedding vectors has the potential to solve the problem of vocabulary mismatch by directly matching “meaning” in a representation space. These so-called dense retrieval techniques can perform ranking directly on vector representations generated by PLMs. In contrast to calculating pairwise differences of token embeddings, this approach offers a much more efficient retrieval procedure. This is performed by matching the embedding vector of a query with the embedding vectors of passages employing an index and approximate nearest neighbor search. Efficient, scalable solutions are available today in open-source libraries.

Given a query q and a set of documents D = {d1, …, dn} we want to define functions ηq(⋅) and ηd(⋅), which convert the token sequences q and d into fixed-width vectors. The functions should have the property that the similarity between ηq(q) and ηd(di) is maximal if di is relevant for query q. We want to estimate

$$\displaystyle \begin{aligned} p(\text{relevant}=1|d_i,q) := \phi({\boldsymbol{\eta}}_q(q),{\boldsymbol{\eta}}_d(d_i)), \end{aligned} $$
(6.5)

where ϕ(⋅) is a similarity comparison function, e.g. the scalar product [124, p. 133]. Note that ηd(di) may be precomputed and organized in an index. By using different encoders ηq(⋅) and ηd(⋅) for queries and documents, we can take into account the different roles and wordings of queries and documents.

SentenceBERT [183] is the prototype of a bi-encoder design for generating semantically meaningful sentence embeddings to be used in large-scale textual similarity comparisons (Fig. 6.3). The query q and the documents di are processed by the same PLM (BERT or RoBERTa). Similarity was compared by the cosine similarity

$$\displaystyle \begin{aligned} \phi({\boldsymbol{\eta}}_q(q),{\boldsymbol{\eta}}_d(d_i))=\frac{{\boldsymbol{\eta}}_q(q)^\intercal {\boldsymbol{\eta}}_d(d_i)}{\left\lVert {\boldsymbol{\eta}}_q(q)\right\rVert *\left\lVert {\boldsymbol{\eta}}_d(d_i)\right\rVert }. \end{aligned} $$
(6.6)
Fig. 6.3
An illustration represents the flow of query and passage through the BERT layers with masked self-attentions. It indicates the layers of position, tokens, sum of embeddings, output embeddings, cosine similarity, and similarity value.

The SentenceBERT model uses two fine-tuned BERT models to transform queries and passages to embeddings of the [CLS] token. Subsequently, a cosine similarity module is used to compute a similarity value

To generate sentence embeddings the authors investigated three alternatives. (1) Use the embedding of the [CLS] token. (2) Averaging (mean-pooling) of all output embeddings. (3) Component-wise maximum (max-pooling) of all output embeddings. Without fine-tuning the results were worse than for non-contextual embeddings. Fine-tuning boosted performance and yields a new Sota. It turned out that average pooling was the most effective design, slightly better than max pooling or using the [CLS] token. Most important the computation time for finding the best match in 10,000 documents was reduced from 65 h to 5 s.

DPR [94] used separate encoders ηq(q) and ηd(di) for the query q and the text passages di of about 100 words. Both encoders took the [CLS] embedding from BERTBASE as its output representation. As comparison function the inner product \({\boldsymbol {\eta }}_q(q)^\intercal {\boldsymbol {\eta }}_d(d_i)\) was used. For each query qi the training set contained one correct passage \(d^+_i\) and a number of negative passages \(d^-_{i,1},\ldots ,d^-_{i,m}\). The loss function encoded the goal to get a large ϕ-value (i.e. similarity) for qi and \(d^+_i\) and small similarities for qi and \(d^-_{i,j}\)

$$\displaystyle \begin{aligned} L(w) = -\log \frac{\exp[ {\boldsymbol{\eta}}_q(q)^\intercal{\boldsymbol{\eta}}_d(d^+_i)]} {\exp[ {\boldsymbol{\eta}}_q(q)^\intercal{\boldsymbol{\eta}}_d(d_i)] + \sum_{j=1}^m \exp [{\boldsymbol{\eta}}_q(q)^\intercal{\boldsymbol{\eta}}_d(d^-_{i,j})]} {} \end{aligned} $$
(6.7)

The negative examples were a mixture of passages retrieved with keyword search that did not contain the answer and thus were difficult negatives. In addition, passages from other examples in the same training batch were used. Instead of performing an exhaustive computation of similarities for all documents between ηq(q) and the ηd(di), we can employ an approximate nearest neighbor search. FAISS [91] is an open-source method based on hierarchical navigable small world graphs. For the Natural Questions benchmark they achieved a top-20 accuracy of 79.4%, which is much better than the previous top-20 accuracy of 59.1% for the keyword-based BM25 search. The replication study [136] could confirm these results, but found that a hybrid approach of DPR and BM25 could increase the performance to 82.6%.

ANCE [238] uses a single RoBERTa model to encode query and document. During training, hard negative examples are selected by approximate nearest neighbor search on an index over the representations generated by the trained encoder. In this way, they can select “difficult” negative examples. The index is periodically updated. On Natural Questions ANCE achieved 82.1% top-20 accuracy. The performance was also compared with the monoBERT cross-encoder, which reranks first-stage BM25 results with monoBERT by comparing all documents to the query. It turned out that on MS-MARCO the application of monoBERT to BM25 had a MRR@10 of 34.7% while ANCE has 33%. The cross-encoder obviously is more effective than ANCE. The authors also applied ANCE to 8 billion documents using embeddings of size 64 and approximate nearest neighbor search. They reported a gain of 16% compared to the prior commercial implementation.

RocketQA [184] performs a first retrieval step and subsequently a re-ranking procedure. Both approaches are jointly optimized using a listwise training approach, where a list of positive and negative examples is used for training both modules. In addition, they perform a data augmentation to construct diverse training instances by incorporating both random sampling and denoised sampling. They report a MRR@10 on MS-MARCO of 38.8% for passage retrieval. When the 50 top results are reranked later, they can increase MRR@10 to 41.9%.

coCondenser [63] is one of the highest entries of the MS-MARCO leaderboard [140]. The model is forced to learn to aggregate information into the “CLS” embedding, which will then participate in the LM prediction. Then an additional “contrastive loss” is used: “CLS” embeddings of passages from the same document close together should be similar, while those for passages in different documents should have a larger distance. This yields highly expressive embeddings for passages. When the model is fine-tuned on MS-MARCO, it returns an MRR@100 of 40.8% on the MS-MARCO leaderboard [140].

6.1.5.1 Available Implementations

6.1.6 Summary

Retrieval is a crucial step in web search, in which a small set of query-relevant candidate passages are identified from a corpus of billions of texts. Discovering more semantically related candidates in the retrieval phase holds great promise for presenting more high-quality results to the end user. Dense retrieval approaches represent a paradigm shift in search engine technology. They make it possible to recognize the meaning of words and paraphrases and thus find much better passages matching a query. Search results can also be used for question-answer models (Sect. 6.2) and dialog systems (Sect. 6.6). They are already being used in production search engine by Bing [35, 238, 266], Google [152, 197], and Facebook [82].

Dense retrieval methods discussed above are fine-tuned in a supervised setting using human relevance labels as input, e.g. from MS-MARCO. Best results are obtained by two different PLMs to encode the query and the documents. Both PLMs are trained to improve the probability of a correct reference document in contrast to some negative documents. As two different PLMs require more effort, most systems use a single model to encode the question and the documents. Experiments show that the combination of dense retrieval and keyword retrieval seems to have advantages. In Sects. 6.2.2 and 6.2.3 retrieval techniques for question answering are discussed, which are even more powerful.

A problem is the transferability of a search system to a new domain. BERT was found to have strong cross-domain relevance classification capabilities when used in a similar way as monoBERT [124, p. 72]. If a BERT model is fine-tuned using relevance judgments from one domain (e.g., tweets) it can be successfully applied to a different domain (e.g., newswire articles). On the other hand, Thakur et al. [221] created a benchmark called BEIR with 18 retrieval tasks from very different domains like bio-medicine and tweets. The authors trained a large number of dense retrieval techniques on MS-MARCO and evaluated then on the other tasks. They found that they were on average less effective than BM25, which due to its simplicity just works in most cases.

The memory requirements for an index for embeddings cannot be ignored. While a keyword Lucene index for the MS-MARCO passage corpus with 8.8M passages needs 661 MB, a FAISS index for vectors of size 768 requires 42 GB and an index for ColBERT takes 156 GB [124, p. 159]. To apply these techniques to web-scale, approaches with a smaller memory footprint are needed.

6.2 Question Answering

Question Answering (QA) is an application of NLP that receives a natural language query and automatically generates a precise answer in natural language. It is a long-standing AI task dating back to the 1960s [69]. Compared with search engines discussed in Sect. 6.1, the QA system presents the final answer to a question directly instead of returning a list of relevant snippets or hyperlinks. Thus, it is more user-friendly and efficient. Often, the system has access to a database or a knowledge base (KB) of documents, such as Wikipedia, where it can search for relevant information.

A Closed Domain QA system handles questions for a specific domain, e.g. medicine, and has background knowledge about that domain or is trained with a large training set covering that domain. Open Domain QA systems (ODQA) deal with questions on almost any topic and usually rely on general KBs or Internet search [37]. Multimodal QA systems address questions in different media, e.g., text and images. A survey of ODQA is given by Zhu et al. [265]. Table 6.3 compiles leading QA Models with their performance.

Table 6.3 Question answering models with their performance. The lower part contains retrieval models. Benchmarks: NQ: natural Questions benchmark of Google queries [109], TriviaQA: TriviaQA benchmark [92, 226], HotpotQA: multihop benchmark [249], EM: exact match

A simple form of question answering is Reading Comprehension, where the system has to identify an answer to a question in a given text. Often a BERT-like system marks the answer span in the text by span prediction (Sect. 2.1.3). This task can mainly be considered as solved. For the SQuAD 2.0 benchmark [179] ALBERT yields more than 93% F1-value and the fine-tuned ST-MoE-32B mixture-of-experts model (Sect. 3.5.2) with 269B parameters [270] achieves 96.3% F1-value, while the human F1-value is 89.5% [178]. However, Sen et al. [199] indicate that systems trained on one dataset may not generalize well to other benchmarks.

6.2.1 Question Answering Based on Training Data Knowledge

Language models often are trained on comprehensive text collections and are able to memorize a large amount of information. A frequently used benchmark is Natural Questions (NQ) [109], which has been sampled from the Google search logs (Sect. 6.1.2). For the given question, the system has to find a short answer span in the given support documents. An example is the question “When are hops added to the brewing process?”, which should yield the answer “The boiling process”.

The TriviaQA benchmark [92, 226] contains a set of trivia questions with answers that were originally scraped from the Web. Different from Natural Questions, the questions here are written with known answers in mind. TruthfulQA [125] is a special QA benchmark with short questions that are constructed adversarially, so that some people’s answers might be wrong due to false beliefs and misconceptions. The answers are evaluated according to informativeness and truthfulness.

6.2.1.1 Fine-Tuned Question Answering Models

The BigBird (Sect. 3.2) self-attention was used as an autoencoder and trained with the MLM objective using an input sequence of 4096 tokens [253]. During fine-tuning on Natural Questions the model had to find a short answer span in one of the given evidence documents. The model achieved 57.9% F1-value on this task. The PoolingFormer [256] is an alternative model for long input sequences with a two-level attention schema. Its first level uses a smaller sliding window pattern to aggregate information from neighbors. Its second level employs a larger window to increase receptive fields with pooling attention. An ensemble of fine-tuned PoolingFormers achieves 61.6% F1-value on the Natural Questions benchmark. The model is similar to the SMITH model [247], which uses a BERT-based hierarchical encoder to capture the document structure information (Sect. 6.1.4).

An alternative is Macaw [218], a freely available QA-system with 11B parameters. It is built on T5 and has strong zero-shot QA-capabilities. On a set of 300 challenge questions the authors claim that Macaw outperforms GPT-3 by 10%, although it has only a small fraction of its parameters. In addition to providing an answers for a question, Macaw can also take an answer and produce a question; or generate multiple-choice options for an answer and a question. The authors also provide a detailed analysis of errors.

It is much more difficult to combine different pieces of evidence to find an answer. A benchmark to test this ability is WikiHop [232], where information from different documents has to be merged. An example is the question “Hanging gardens of Mumbai, country?” and the documents “The Hanging Gardens, in Mumbai, also known as Pherozeshah Mehta Gardens, are terraced gardens …” and “Mumbai is the capital city of the Indian state of Maharashtra. It is the most populous city in India …”. For each query up to 140 background paragraphs are provided to the model. On this benchmark BigBird-ETC (Sect. 3.2.1) achieved an accuracy of 82.3%. Currently, the best model for this task is the RealFormer with an accuracy of 84.4% [171], which is slightly below the human performance of 85%. The RealFormer is an autoencoder with a modified architecture, which provides a bypass with the raw attention scores of all attention heads from the previous layer in the subsequent layers [76].

6.2.1.2 Question Answering with Few-Shot Language Models

Recent Foundation Models are trained with an enormous collection of documents and can generate answers to questions without additional knowledge input. An example is the autoregressive language model GPT-3 with 175B parameters, which was pre-trained on a text collection of books, Wikipedia and web pages of about 500 billion tokens (Sect. 3.1.2). Because of its high model capacity it can absorb a lot of ‘knowledge’ in its parameters. When a Foundation Model is not allowed to use external information, this is called Closed-book QA.

As discussed in Sect. 3.6.3, GPT-3 can be instructed by a few examples (few-shot) to solve a task. Figure 6.4 provides a few-shot prompt example. For Natural Questions, GPT-3 achieves an exact match accuracy of 14.6% in the zero-shot setting, 23.0% in the one-shot setting, and 29.9% in the few-shot setting [29, p. 14]. This was achieved without fine-tuning on Natural Questions. The larger Gopher model with 280B parameters (Sect. 3.1.2) performs slightly worse with 28.2% in the few-shot setting [175, p. 80].

Fig. 6.4
A text box represents a set of prompts and their answers given by the bot. It provides the response unknown if there is no clear answer.

A possible few-shot prompt for GPT-3 to get an answer based on existing knowledge acquired during pre-training [160]

The even larger PaLM model with 540B parameters (Sect. 3.1.2) was trained on a high-quality dataset with 780B tokens. It uses a new prompt technique to pose logical questions, where examples are presented to the system together with thought chains partitioning a reasoning task into smaller problems (Sect. 3.6.4). In this way it gets the recipe to combine facts from different sources to arrive at the final answer.

PaLM was evaluated on a large number of other benchmarks, which in part are QA-tasks. On Natural Questions it arrived at an accuracy of 21.2% with 0-shots and at 36.0% with few-shot prompts [43, p. 47]. On Trivia QA (questions concerning the Wikipedia), BoolQ (question answering with yes/no answers), and PIQA (question answering with reasoning) PaLM also achieved a new Sota. The results are shown in Table 3.4. PaLM was benchmarked with a large number of tests, among them the more than 150 BIG-bench tasks (Sect. 4.1.4). Many of them are QA-related tasks: 21 contextual QA tasks, 24 context-free QA tasks, 36 reading comprehension tasks, and a large number of tasks on specific knowledge and common sense [1, 22]. Additional outcomes for QA-benchmarks of PaLM are given in [43, p. 12], where PaLM always achieves Sota.

6.2.2 Question Answering Based on Retrieval

Retrieval ODQA systems usually work in two stages: for a question a retriever module finds a number of documents from a text collection, which might contain the answer. Subsequently, a reader considers the question and the retrieved documents and generates a natural language answer (Fig. 6.5). Since the model relies on external information, it is referred to as Open-book QA.

Fig. 6.5
A flow diagram represents the flow of unstructured documents through the retriever, followed by relevant documents, and the reader, before generating the answer.

Question answering often combines dense retrieval with an answer selection module. The retriever performs a dense retrieval by comparing the embedding of the query with the embeddings of passages. The reader ranks the retrieved documents and generates an answer by an autoregressive Pre-trained Language Model [36]. Credits for image parts in Table A.2

Retrievers have been introduced in Sect. 3.4.5 and were discussed in the context of document retrieval in Sect. 6.1. The retriever may employ a traditional search engine using tf-idf weighting or BM25. Alternatively the retriever may be a dense retriever based on document and question embeddings. It is trained to retrieve passages by computing embedding similarities e.g. by DPR [94] (Sect. 3.4.5). A tutorial on ODQA is provided by Chen [36].

The reader is usually an autoregressive language model that receives both the query and the retrieved documents as inputs. It is fine-tuned to generate a response to the query based on the retrieved information and its internal knowledge.

Question answering with external knowledge bases has the advantage that curated KBs usually are checked for correctness. They may have, however, limited coverage of entities and relations may not be up-to-date. There are a number of approaches to combine PLMs with KBs using techniques like entity mapping (Sect. 3.4.1). Recent papers propose a hybrid approach using KBs and retrieval [239]. Knowledge-Guided Text Retrieval [145] starts with retrieving text passages for a query. It creates a passage graph, where vertices are passages of text and edges represent relationships that are derived either from an external knowledge base or co-occurrence in the same article. On Natural Questions [109] they achieve an accuracy of 34.5%.

HYBRIDER [41] uses information from a retriever as well as from a structured KB and tables. The authors collected Wikipedia pages and constructed a benchmark dataset HybridQA containing question-answer pairs requiring multi-hop reasoning using text, tables and hyperlinks (Fig. 6.6). The model first links questions to tables cells as well as Wikipedia passages and hyperlinks. In a reasoning phase the linked information is ranked and consolidated to derive the probabilities of different answers. The experiments with the dataset show that the utilization of tables or retrieval alone achieves an exact match accuracy of about 20% while the joint model yields more than 40%. However, the hybrid model’s score is still far below human performance.

Fig. 6.6
A table exhibits the list of names, years, seasons, and flag bearers in the Olympic events. It indicates various details of the flag bearer on the right side. Below are 2 questions related to the year and sports category along with their answers.

For hybrid question answering Wikipedia pages are retrieved by HYBRIDER [41] (top left). Some pages contain tables (left). Here the column titles may be interpreted as well as hyperlinks to entities (underlined). The lower part shows two human-annotated question-answer pairs. Image reprinted with kind permission of the authors [41, p. 2]

One of the first retrieval-reader systems was DPR (Dense Passage Retriever) [94]. It employs a BERT model to encode passages by embeddings and retrieves them by approximate k-nearest neighbor search with the FAISS index (Sect. 6.1.5). In this way it can gather passages with similar meaning but different wording. The DPR reader is another BERT model which is fine-tuned to predict a probability for each retrieved passage that this passage contains the correct answer. In addition, it selects a span of tokens by span prediction, which probably provides the answer. The approach can be easily applied to KBs with billions of passages [94, 213]. On the Natural Questions [109] it yields a test set accuracy of 41.5%.

FiD [84] is described in Sect. 3.4.5. The retriever is based on DPR and compares query and passages embeddings. Raffel et al. [177] have shown that generative models like T5 can produce the answer for QA-tasks. FiD processes the query and the retrieved passages by a reader based on a T5 model to generate an answer. Since the first step is to process the passages one by one, the system is very efficient. FiD achieves an exact match accuracy of 51.4% on the Natural Questions test set compared to 41.5% for DPR.

REALM [75] and RAG [114] are retrieval augmented generative models for open domain question answering. However, they process all retrieved passages simultaneously in an autoregressive language model and were unable to take into account a large number of passages leading to lower accuracies on Natural Questions of 40.4 for REALM and 44.5 for RAG. Sachan et al. [194] propose an end-to-end differentiable training method for retrieval-augmented ODQA. Latent variables indicate which of the relevant documents should be included. The values of the latent variables are iteratively estimated by an EM-algorithm. On Natural Questions they achieve an exact match accuracy of 52.5%.

MTR [138] starts from the observation that neural retrievers perform well on their fine-tuning domain, but will typically achieve low out-of-domain performance. The authors propose a multitask retriever similar to DPR which is jointly fine-tuned on eight diverse retrieval tasks. They use a shared passage encoder—so that a single index of encoded passages can be used—as well as a query encoder that is shared across all tasks. In five of the eight models they achieve a higher performance than special models tuned to the corresponding domain.

AISO [268] is a retriever-reader architecture for solving multi-hop QA tasks, where multiple documents are required to answer a question. Repeated retrieval rounds are performed in which associated terms are taken as new search queries to find additional evidence. The approach is adaptive and at each step selects one of three types of retrieval operations (e.g., BM25, DPR, and hyperlink) or one answer operation. On the HotpotQA benchmark [249], the question-answering system must find the answer to a query in the scope of the entire Wikipedia. The AISO model achieved a new Sota with a joint F1-value of 72.0%.

The FB Hybrid system was presented at the EfficientQA competition [147], where real user questions for the Google search engine from the Natural Questions dataset [109] were tackled. While the original NQ was a reading comprehension task providing a number of evidence documents for each question, the EfficientQA benchmark [147] adapted this to open-domain QA by taking examples with up to five token answers and discarding the evidence documents. The system uses a retriever-reader architecture [158]. The retriever is a mixture of DPR and another retrieval system, which covers lists and tables as well as KB-relations and retrieves 100 passages. The reader is a T5-large Seq2seq model, which is given 100 passages from the retriever and generates an answer. The background corpus contained 18.8M passages from Wikipedia. On Natural Questions the model achieves an exact match accuracy of 53.9%. According to an evaluation by human raters the model was able to answer 67.4% of the questions correctly, which is about as good as the performance of human experts using a search engine. The MS UnitedQA model had a similar architecture [139]. It uses a BERT-based retriever and a reader combined from a T5-model and ELECTRA processes the returned documents to generate different answers. A final re-ranking model selects the answer. MS UnitedQA yields an exact match accuracy of 54.0% and 65.8% correctness on Natural Questions. If the systems were restricted to a memory footprint of 6 GB the performance was only marginally reduced.

6.2.3 Long-Form Question Answering Using Retrieval

6.2.3.1 A Language Model with Integrated Retrieval

Retro [25] is an autoregressive language model with 7B parameters using retrieved information to predict the next token. As retriever a frozen BERT model is employed (Fig. 6.7). Each training sequence is split into chunks, which are augmented with their k-nearest neighbors retrieved from the database of 2 trillion tokens. The returned information is processed in a language model to improve the prediction of the next token leading to large performance gains. The reader consists of a differentiable autoregressive encoder and a chunked cross-attention module to predict tokens.

Fig. 6.7
A model diagram illustrates the input sequence finds the nearest neighbor embeddings followed by the database of 2 trillion words to encode neighbors, which further progresses to the autoregressive language model to generate the output.

The Retro language model retrieves information for the input sequence. The model uses the input sequence and the retrieved neighbor chunks from the database as input and generates an appropriate output [176]

An input sequence v = (v1, …, vn) of n=2048 tokens is split into chunks ct = (v(t−1)∗m+1, …, vtm) of length m=64. Each chunk ct is expanded with a set Ret(ct) of retrieved k nearest neighbor chunks from the database. The probability of a token vtm+i in the next chunk ct+1 then can be recursively computed as

$$\displaystyle \begin{aligned} p(v_{t*m+i}|v_{t*m+(i-1)},\ldots,v_{t*m+1},\boldsymbol{c}_t,\text{RET}(\boldsymbol{c}_t),\ldots, \boldsymbol{c}_1,\text{RET}(\boldsymbol{c}_1) ) {}. \end{aligned} $$
(6.8)

The probability of the i-th token of the (t + 1)-th chunk ct+1 depends only on the previous tokens and on the data Ret(cj) retrieved from the database for the previous chunks. This integrates the retrieval process in the language model.

The retriever for a chunk ct uses the average Bert(ct) of all BERT embeddings of the tokens in ct as key. It retrieves the k nearest neighbors from the database with respect to the L2 distance \(||\text{BERT}(\boldsymbol {c}_t)-\text{BERT}(\tilde {\boldsymbol {c}_s})||{ }_2^2\). The model receives the corresponding chunks \(\tilde {\boldsymbol {c}}_{s,j}\) and additionally their continuation chunk \(\tilde {\boldsymbol {c}}_{s+1,j}\) for j = 1, …, k, which collectively form the elements of Ret(ct). By filtering it is avoided that the chunk to be predicted is included in Ret(ct), as this would invalidate the conditional probability definition. The retrieval is performed in \(O(\log T)\) time using the SCaNN library [73], which collects a set of chunks from a database of 2 trillion tokens in 10ms. Note that the document corpus of Retro is about 1000 times larger than the databases of FiD and other retrieval models.

Inside the reader the retrieved tokens in Ret(ct) are fed into an autoencoder, which computes a set E of encoded neighbors. Then, so-called Retro blocks

$$\displaystyle \begin{aligned} \text{RETRO}(H,E) := \text{FCL}(\text{CATL}(\text{ATTL}(H),E)) {}, \end{aligned} $$
(6.9)

and standard self-attention blocks Lm(H) := Fcl(Attl(H)) are interleaved and operate on the intermediate embeddings \(H\in \mathbb {R}^{n\times d}\). Here Fcl(⋅) is a fully connected layer, Attl(⋅) a self-attention layer, and Catl(⋅, E) a cross-attention layer which includes the information in E. The input and output dimension of these modules is \(\mathbb {R}^{n\times d}\).

The resulting language model is able to predict the next token with a high reliability. The Pile data [62] is a 825GB open-source text collection set that consists of 22 diverse, high-quality datasets. It was screened for toxic language and bias, e.g. with respect to gender, religion, and race. Its authors recommend measuring the quality of token prediction in bits per byte (bpb), which in contrast to perplexity is independent of the tokenizer [62, p. 6]. The authors compare Retro with GPT-3175B [29], Jurassic-1178B [121], and Gopher280B [176]. It turns out that Jurassic-1 has the lowest (and best) bpb-value on 5 Pile datasets, Gopher on 2 datasets and Retro on 9 datasets, although it is far smaller than the other models [25]. GPT-3 was inferior to all three models. A possible problem for these results is the overlap of the retrieval corpus with the test data.

For the LAMBADA benchmark [165] a model has to predict the last word of a paragraph. The authors measure the following accuracies: Retro without retrieval 70%, Retro with retrieval 73%, Gopher 74.5%, and GPT-3 76.2%. Note that Retro has only 4% of the parameters of GPT-3. For question answering the Natural Question benchmark is relevant. Here, Retro achieved an exact match accuracy of 45.5%.

The LaMDA [222] dialog system (Sect. 6.6.3) is an expanded version of Retro with 137B parameters. It demonstrates that facticity can be improved by retrieval models. In addition, it is able to reduce toxic language by a system of filters that block unwanted speech. Although this model could also easily be used for question answering, no corresponding benchmark results are known.

6.2.3.2 Controlling a Search Engine by a Pre-trained Language Model

WebGPT [149] extends GPT-3 to control the Bing search engine and performs a web search for a specific query. The language model must issue commands such as “Search …”, “Find in page: …” or “Quote: …”, as shown in Fig. 6.8. In this way, the model collects passages from web pages which contain information relevant for the question. The utilization of Bing has the advantage that it has powerful search capabilities, and covers a large number of up-to-date documents.

Fig. 6.8
A table represents a set of 10 commands and their effects. The commands are, search, clicked on link, find in page, quote, scroll down, scrolled up, top, back, and end

Possible actions of the WebGPT language model. If another text is generated, this is an invalid action and ignored [149]

Browsing continues until the model issues a command to end browsing, the maximum total length of references has been reached, or the maximum number of actions has been reached. If a relevant reference has been retrieved, the model will generate a long-form answer to the question.

The GPT-3 model is first fine-tuned to mimic human demonstrations, enabling it to use the text-based browser to answer questions. Then, the usefulness and accuracy of the model’s answers is improved by fine-tuning a reward model to predict human preferences, and optimizing it by rejection sampling. Specifically the model is fine-tuned to answer questions from ELI5 [56], a dataset of open-ended questions obtained from the subreddit ‘Explain Like I’m Five’. An example is given in Fig. 6.9. The proposed WebGPT answers should be coherent, relevant, and supported by trustworthy documents. No details are reported on the input prompts of GPT-3 containing the current state of search, and how the GPT-3 model combines the returned documents into an answer. Note, however, that there is significant overlap between training and validation in ELI5, as at least 81% of ELI5 validation questions occur in the training set [106] in circumscribed form.

Fig. 6.9
A text box represents a question related to the application of contact lenses. There is a passage of answers at the bottom along with the citations.

Long-form answer to a question generated by WebGPT. The best of 64 answers was automatically selected. The citations were automatically retrieved from the Bing search engine and added to the answer [80]

The final answers were selected from 64 trials of the 175B WebGPT model by ranking. These answers were preferred by human raters to the reference responses from the ELI5 dataset 69% of the time. Moreover, they were preferred to the human demonstrator responses in 56% of the cases.

For WebGPT, responses to TruthfulQA [125] were correct about 75% of time, whereas GPT-3 scored 64% with helpful prompts. While GPT-3’s answers were truthful and informative in about 20% of the time, the best version of WebGPT increased this to about 56%. Since people answered 94% of the questions correctly, the models still have a significant performance difference. On TriviaQA WEBGPT achieved a score of 69.5%, which is far less than the value of PaLM with 81.4%.

An innovative feature is the support of text passages by references. This corresponds to the approach of scientific papers to underpin claims by references and was already suggested by Metzler et al. [143]. The references explain the answer and support the factual accuracy of the statements. The citations are selected by Bing in response to the query. They should therefore be close to the final reader-generated response and provide an easy way to assess the correctness of the response.

However, the authors point out that the references are not always representative for the available evidence, although the model cites references that correspond to the generated text. In addition, it is difficult for the model to verify the trustworthiness of references. Here, Web-of-Trust systems and search engine technology could be employed, which favor trust-checked frequently linked web pages.

6.2.3.3 Available Implementations

6.2.4 Summary

A number of Foundation Models have been presented, which were able to improve Question Answering performance. Examples are the autoregressive language models GPT-3 (175B), Gopher (175B), and PaLM (540B) with huge parameter sets, which are trained on a large document collections and can acquire extensive knowledge. Using few-shot prompts they are able to answer questions with high accuracy without employing external knowledge.

Recently, the retriever-reader architecture has been increasingly used for QA systems. It has the potential to tap into a larger knowledge base or the Internet that can easily be kept up-to-date. The retriever can employ keyword search or dense retrieval. Dense retrieval mitigates the term-mismatch problem, where relevant paraphrases are ignored. Usually, embeddings for each document or phrase are pre-computed and the embedding index is constructed beforehand. Current systems can access document collections of up to trillions of tokens using advanced nearest-neighbor search engines like FAISS and SCaNN to compare embeddings.

The reader usually receives the query and the returned passages in text form and generates the answer. It is fine-tuned to select the correct answer and to provide answers which are expressive and truthful. The Retro model is an autoregressive language model with only 7B parameters, which uses passages retrieved by a frozen BERT model as additional current state information to generate the next tokens. It is capable of improving accuracy to high levels for many QA tasks, but can also be used for other applications such as story generation.

WebGPT combines GPT-3 and the Bing search engine to retrieve documents and create appropriate answers. It is able to enhance the generated text by references to documents, which justify and explain the answer. The LaMDA dialog model is an expanded version of Retro with 137B parameters with specific tuning to increase usability and factual accuracy. In addition, it is able to reduce toxic language by a system of filters that block unwanted speech. These techniques can also be applied to question answering.

Still difficult is the generation of answers where the correct response needs information from multiple documents. In this case several rounds of querying are necessary. Special models like RealFormer, HYBRIDER, or AISO can improve the performance for benchmarks like WikiHop.

6.3 Neural Machine Translation

Language is the cornerstone of most human communication and interaction. Moreover, many persons think in terms of language, and use it to express and communicate feelings, goals, and ideas. We communicate knowledge by language and use it to establish social and emotional relationships. There are more than 7100 languages in the world [19], some of which are shown in Fig. 6.10. The ability to understand each other across language barriers is essential for communicating ideas between people.

Fig. 6.10
A world map indicates different languages used in different regions all across the world.

This map shows some of the world’s 7100 languages, with each dot representing a language and the color indicating the top language family for each language. Only a small fraction of the world’s languages are currently represented in Foundation Models. Image reprinted with kind permission of the authors [24, p. 23]

After an initial success with Recurrent Neural Networks [15, 215] the development of the Transformer encoder-decoder (Sect. 2.3) has driven progress in Neural Machine Translation (NMT). By cross-attention a “correlation” between each token of the source text and the translated text can be established, producing better translations than before. The availability of large training sets and better model architectures has steadily increased the performance of Pre-trained Language Models for NMT (Fig. 6.11). Standard models for multilingual processing are described in Sect. 3.3. A survey is provided by Yang et al. [248].

Fig. 6.11
An area graph represents the B L E U score of translation of different languages into English. The graph highlights the durations of March 2008, May 2019, and May 2020.

Bleu scores for Google translation of 100+ different languages to English for different years. Image credits in Table A.2

6.3.1 Translation for a Single Language Pair

The training data of NMT consist of text pairs of the source language and its translations to the target language. Traditionally evaluation is done by comparing one or more reference translations to the proposed translation, as described in the survey [195]. There are a number of automatic metrics like Bleu, Meteor or BERT-score (Sect. 2.3.3). It turned out that there is a noticeable difference between human judgment and automatic evaluation. Therefore, most high-end comparisons today use human translators to assess the quality of translation methods.

At the WMT2021 Machine Translation conference, numerous teams solved benchmarks tests for translating English news texts from/to German, Japanese, Russian, Chinese, and a number of low-resource languages [5]. Instead of using comparison statistics like Bleu, the translations of each system was evaluated by a number of human evaluators without showing them a reference translation. They were asked to rate a given translation according to how adequately it expressed the meaning of the corresponding source language input on an analog scale, which corresponds to an underlying absolute rating scale of 0–100. As some raters could be stricter, the systems are ranked by a z-score, where the score is mean-centered and normalized per rater. Systems are grouped together according to which system significantly outperforms all others measured by the Wilcoxon rank-sum test. A large effort was spent to assess the validity of human evaluation.

In total 173 submissions were received. In addition, five anonymized online systems were included. Further human-produced reference translations were denoted by “HUMAN” in all tables. Results show that almost all good systems are based on transformer encoder-decoders. Words are mostly encoded by the SentencePiece [107] tokenizer (Sect. 1.2). A widely used technique is back-translation [200]. Here a monolingual text is translated to the other language and then back-translated. By minimizing the difference to the original text, both models may be improved. Up to 500M sentences per language were available and could be used for back-translation, which led to a significant improvement in quality. In addition, ensembles are able to increase the performance in most cases.

The result of the best system for each language pair is shown in Table 6.4. Usually, there is a cluster of 2–5 models at the top, whose performance differences are not significant. The Facebook-AI model (FB) had the best results for half of the language pairs. In addition, the Bleu scores for the best systems automatically computed from n-grams are shown. As can be seen, the values are in general better for the translation “to English” than “from English” especially for morphology rich languages like Czech and German. Compared to the human reference translation, the best system was significantly better for three language pairs. This has already been discussed critically by Toral [223], who decry the limited amount of context between sentences and the limited translation proficiency of the evaluators.

Table 6.4 Leading systems of the WMT2021 News Translation Task. The systems are ordered by normalized z-score [5, pp. 15–19]. If either the best system or a human reference translation is significantly better, the value is printed in bold. Systems: FB: Facebook-AI, BL: Borderline, HW: HW-TSC, NV: Nvidia-NeMo, NI: NiuTrans, OB: Online-B, OW: Online-W, HN: HappyNewYear

Improved performance was reached by increasing the number of parameters. The Facebook model [224], for instance, used a standard model of 4.7B parameters and a sparsely gated mixture-of-experts system with up to 128 experts. In each Sparsely Gated MoE layer, each token is routed to the top-2 expert feedforward blocks based on the score of a learned gating function. In addition, the models were fine-tuned with domain-specific data from the news domain. The n-best hypotheses were generated with a beam search. These were ranked with a weighted average of the probabilities p(tgt|src), p(src|tgt), and p(tgt), where src is the source and tgt is the target sentence.

It is well-known that the translation of single sentences suffers from ambiguities (e.g. pronouns or homonyms), which can be resolved by considering the document context. In WMT2021 this is taken into account by assessing the quality of translation within the document context [5]. As current encoder-decoder Foundation Models are able to consider larger contexts, this could improve translation performance [141]. Instead of finding the most probable translation of a sentence, we need to generate the best translation for a given complete source document. While comparing sentence-level translation often does not indicate a difference between human and machine translation, the comparison of document-level translation often yields a statistically significant preference for human translations [110].

Instead of using a Seq2seq model with extra long input sequence, HAT [187] proposes a hierarchical attention transformer. The authors split the input text into sentences and start each sentence i with a specific [BOSi] token. These tokens summarize the sentence content and are connected to the other sentences by the usual self-attention and cross-attention. While the usual encoder-decoder transformer has a Bleu of 32.5 for the document translation from English to German on WMT2019, HAT is able to yield a SotaBleu of 34.5.

6.3.2 Multilingual Translation

Usually, languages with scarce training data have a much lower translation accuracy, as holds for Hausa in Table 6.4. A recent success was the extension of NMT by multilinguality, which was already discussed in Sect. 3.3. This led to a marked improvement of translations for languages with few resources. For a survey see [48].

M2M of Facebook AI [57] improves translation between many languages by utilizing a massive corpus of 7.5B sentences covering 100 languages and thousands of translation directions with supervised data, created through large-scale mining. The model is a transformer encoder-decoder with 15B parameters. The authors add a special token in the encoder indicating the source language and a special token in the decoder indicating the target language. The transformer has 12 encoder and 12 decoder layers and an embedding size of 1024. As there is a joint token vocabulary for all languages, the input and output embeddings are shared. To improve performance the authors added language-specific layers to the decoder for five languages. Using specific parallelization techniques they were able to train the model with only hundreds of GPUs.

Except for four language directions (En→Chinese, Chinese→En, En→Fi, En→Estonian) the model improved translation results on the WMT benchmarks for 1.9 Bleu points on average. Especially marked is the improvement for regional languages with an average increase of 7.6 Bleu. For resource-rich language pairs Liu et al. [130] propose to use very deep transformers with up to 60 encoder layers and 12 decoder layers. They develop a simple yet effective initialization technique that stabilizes training and achieve Sota on WMT2014 En-Fr of 46.4 Bleu.

Although multilingual translation has many advantages, it usually performs worse than specially trained bilingual models for high-resource language pairs. Recently Facebook [225] presented a single multilingual model, which outperformed the best specially trained bilingual models across 10 out of 14 language pairs of the WMT2021 news benchmark. Facebook built two multilingual systems: any-to-English and English-to-any. They employed data mining techniques to identify translations in large web crawl data and leverage available monolingual data with hundreds of millions of sentences from all eight languages to maximize performance of MT systems. They filtered the available monolingual data to reduce the amount of noise, and then back-translated them with an ensemble of the strongest multilingual models available. The number of parameters was increased from 15B to 53B to enhance the model capacity.

The Bleu scores are shown in Table 6.5. In comparison to the best bilingual models of WMT2021, the multilingual model achieves a better Bleu in 9 of 14 cases indicating that the additional training data from other languages supports translation. Only for Chinese→English there was a larger drop of 1.3 Bleu points. The authors also performed a human evaluation for the language pairs English→Russian and English→German. It turned out that there was no perceived difference between the quality of bilingual and multilingual translations.

Table 6.5 Bleu scores of the Facebook multilingual model and the best language pair model submitted to the WMT2021 news task. The numbers reported are Bleu scores on the final WMT2021 test set [225]. The difference between the models is printed in bold, if the multilingual model is better

Table 6.6 shows the effect of employed improvement strategies for the different languages of the multilingual model. Back-translation has a large effect for languages with little training data like Hausa and Icelandic. The authors note, however that back-translation produces translationese by generating artificial uncommon phrases in a language. These effects may be mitigated by fine-tuning on the specific domain, e.g. news texts. This yields about 3 Bleu points for translation into English and 0.7 Bleu points for translation out of English. Switching to the multilingual model generates an improvement for all models. While the effect of model ensembles is minor, re-ranking the BEAM translations with conditional target-source probabilities yields about 0.4 Bleu points. Postprocessing (for example applying standard punctuation rules) can have a large effect, e.g. 5 Bleu points for Chinese.

Table 6.6 Influence of different modeling improvements on the Bleu scores on the development set of WMT2021 for Facebook AI’s WMT2021 submission [225]. The version of the last row was submitted

The PaLM autoregressive language model with 540B parameters [43] has about 22% non-English training texts among its 780B training tokens (Sect. 3.1.2). Similar to other large LMs, PaLM is not trained explicitly on parallel text, although some such data is likely to exist naturally in the training corpus. In Table 6.7 the results of PaLM 540B few-shot translation is compared with prior few-shot and fine-tuned Sota [43, p. 27]. The best Bleu value per language pair is underlined and the best few-shot Bleu is printed in bold. The table shows that PaLM on the traditional WMT translation pairs always achieves the best few-shot Bleu, often improving by a wide margin. For the low-resource language Kazakh (kk) the fine-tuned translation models have a better Bleu than PaLM. However, for de→en and ro→en PaLM is able to outperform the supervised models. In addition, the 0-shot PaLM translation of fr→en achieves a Bleu value of 25.2, which is better than the fine-tuned Sota of 24.9. Overall, PaLM performs well close to the fine-tuned models without having been trained for this task.

Table 6.7 Comparison of PaLM few-shot translation performance against prior fine-tuned translation performance by specialized models and prior few-shot performance. On the left you find the translation from English and into English for the traditional WMT language pairs. On the right there is the translation to and from English to Kazakh (kk) and a translation between German and French [43, p. 27]

6.3.3 Multilingual Question Answering

In recent years open domain question answering (ODQA) has taken a rapid development (Sect. 6.2). Therefore, it is extremely rewarding to extend these techniques to multilingual question answering. In this way, information encoded with the world’s different languages can be tapped and the digital divide can be narrowed by bringing answers to people who speak rarer languages. There is a tutorial on multilingual ODQA by Ruder [192, 193].

A simple way to perform multilingual ODQA is to translate the question to English, use an English ODQA system to generate an answer, and translate the answer back to the target language. Because of ambiguities in translation, this procedure may generate errors in some cases [132]. In addition, information specific to the target language and conceptualizations of the target culture may not be available in English [258].

The TyDiQA-GoldP benchmark [44] is a question answering dataset covering 11 typologically different languages with 204K question-answer pairs. The following languages are included: English, Arabic, Bengali, Finnish, Indonesian, Japanese, Kiswahili, Korean, Russian, Telugu, Thai. As the languages represented in this benchmarks have a very diverse structure, a model which performs well on this data can be expected to have a good QA-accuracy on other languages. MKQA [133] is an evaluation dataset created by translating 10k Natural Questions [109] to 25 target languages.

As an alternative, one can train cross-lingual retriever and reader models combining the information from multiple languages to generate an answer in the target language (Fig. 6.12). CORA [13] answers questions across many languages, even for ones without language-specific annotated data or knowledge sources. It includes a dense passage retriever collecting documents with different languages for a question. A pre-trained multilingual language model mDPR using mBERT (Sect. 3.3.1) is fine-tuned to encode passages and questions separately. By performing a maximum inner product search the top k documents are retrieved similar to DPR (Sect. 3.4.5). It could be shown that mBERT improves the search quality in non-English mono-lingual retrieval [203]. The reader mGEN is a multilingual autoregressive sequence model (e.g. mT5, Sect. 3.3.2) generating the answer in the target language by compiling the information in the retrieved passages. No specific translation models are used. The initial training data is a combination of multilingual QA datasets. Each training instance from these datasets comprises a question, a positive passage, and an answer. However, these datasets suffer from limitations on language diversity. Therefore, the authors iteratively generate more representative training data for low-resource languages by exploiting links between Wikipedia articles in different languages.

Fig. 6.12
An illustration represents a set of 2 queries, one in French and the other in Norwegian, that go through the retrieved documents to generate the answers. The generated answer for the French question is wrong, while for the Norwegian question is right.

Cross-lingual retrieval by mDPR and answer generation with mGEN for the CORA system [13, p. 9]. The answers to the questions are correct, however, on the left side the answer should have been given in French

It turns out that CORA substantially outperforms the previous Sota on multilingual open QA benchmarks across 26 languages, 9 of which are unseen during training. Here CORA can improve the average F1-value from 17.1 to 21.8. Retrieval with mDPR performs well in Indo-European languages with Latin script, even when the language is unseen. There is a major drop for languages with non-Latin script (e.g., Japanese, Russian, Chinese). Here, perhaps, the model is unable to use relevant passages from other languages to answer questions.

mT5 (Sect. 3.3.2) is a multilingual version of the T5 Seq2seq transformer with up to 13B parameters [246]. It was pre-trained using a training dataset of web pages covering 101 languages with about 48B tokens and a common vocabulary of 250k tokens. After fine-tuning on the TyDiQA benchmark, it arrives at an exact match score of 79.1%. ByT5 [245] is a variation of the mT5 multilingual encoder-decoder with 12.9B parameters. It operates on utf-8 bytes with a vocabulary of 256 possible byte values instead of tokens. The model is pre-trained to replace corrupted spans of 20 bytes on average. The largest model uses 36 encoder and 12 decoder layers. When the model is fine-tuned on gold data in all target languages, it achieves an exact match score of 81.4% on the TyDiQA benchmark.

The PaLM Foundation Model [43] has about 22% non-English training texts in its 780B training tokens (Sect. 3.1.2). Therefore, it can be applied to multilingual tasks such as translation and question answering. With few-shot prompts it gets an exact match score on TyDiQA of 60.5%. When the model is fine-tuned on TyDiQA, the score grows to 80.0%, which is slightly below of the performance of ByT5 XXL. The detailed results in Table 6.8 show the performance for different languages. Here PaLM has a better score for two languages than ByT5. The authors remark, that ByT5 was trained with 50% more non-English text compared to PaLM, which may explain the difference.

Table 6.8 Comparison against Sota on TyDiQA question answering benchmark with 11 typologically different languages. The values are for the validation set with respect to the exact match accuracy [43, p. 32]. Best values for each language printed in bold

6.3.3.1 Available Implementations

6.3.4 Summary

In recent years, machine translation has taken a dramatic development. The use of encoder-decoder PLMs could overcome the limitations of RNN architectures and increase the performance to near-human levels. Besides the utilization of encoder-decoder Transformers, the availability of high-quality training examples by web crawlers using Foundation Models and specific assessment procedures is a reason for progress [33]. A further improvement resulted from sentence back-translation, which particularly increases results for low-resource languages, and from training a single multilingual model for translation between all languages. Training multilingual translation models with up to 600B parameters—using appropriate parallelization strategies—leads to significant performance increase for 100 languages, as measured by Bleu [113]. Recently multilingual models even were able to outperform high-resource bilingual translation models. This is also demonstrated by the PaLM Foundation Model, which achieved higher performance in few-shot translation than the prior fine-tuned models for some language pairs. Therefore, multilingual models are likely to become standard in the future. However, current multilingual models using unsupervised multilingual training may not deeply model the subtleties of languages and language varieties to their full extent. This has to be checked in future applications.

The developments opened up the opportunity for multilingual question answering systems, e.g. CORA, where queries can be posed in a large number of languages. The answers are compiled from information available in multiple languages. In this way, cultural characteristics and concepts that are not available in all languages can be taken into account. There are also close links to cross-lingual semantic parsing, where a natural language utterance is translated to a logical form for execution in some knowledge base to return an answer [202]. Again the PaLM Foundation Model provided few-shot answers to multilingual questions, which are competitive in accuracy to fine-tuned models for the same benchmarks. A fine-tuned version of PaLM is even able to outperform prior fined-tuned Sota for two languages.

However, machine translation is not yet solved. There is still the problem of domain mismatch between train and test data. In some cases, it fails to accurately capture the meaning of a sentence. Systems can generate biased text, e.g. if gender is handled differently in different languages. But attention allows the decoder to look directly at faraway text and provides a soft alignment between words for free. Recently, performance could be increased by translating entire documents, as sentences often are not sufficient to disambiguate all words. To extend current multilingual models to thousands of languages, new techniques are required [19]. One approach is to use monolingual datasets to improve translation, since the amount of available monolingual text is orders of magnitude greater than the amount of translated text. This in addition requires highly reliable language detectors which also work for low-resource languages.

6.4 Text Summarization

With the rapid increase of textual information in companies and on the Internet, it is increasingly difficult for people to keep track of a topic. Automatic summarization of documents, which compiles the essential statements from a text, can help to grasp the most relevant information in the documents. A summary is a short version produced from a single document or multiple documents conveying the main points of the original texts. The purpose of automatic text summarization is to create a summarizer method to produce this summary efficiently and precisely. Recent in-depth surveys are provided by Ma et al. [135], Guan et al. [71], Syed et al. [216], and El-Kassas et al. [95].

Earlier machine learning approaches produced extractive summaries selecting a few sentences from the document. This approach typically selected grammatically correct sentence parts, but the language style of the combined parts and the coverage were usually not sufficient. Modern summarizers pose summarization as a translation problem, which translates the original document to a short version covering the main points. Since 2017 the encoder-decoder transformer (Sect. 2.3) provided an effective technique to generate abstractive summaries containing the main points of the document. Abstractive summarization is a bit more complex because the text is paraphrased, and the summary usually has words different from the original document. On the other hand, it is more flexible and can aggregate several similar texts expressing related facts with different wordings.

Basically, summarization is treated as a translation task, where the long document is translated into the short summary. Alternatively we can use the long document as the start text of an autoregressive Foundation Model, which is fine-tuned to generate a summary. One of the main challenges for Seq2seq models is that the decoder needs to attend to encoder token embeddings in the large document context to predict the next token of the summary. Therefore, Seq2seq models covering a long input context (Sect. 3.2) are natural candidates. Summarization systems can be either single document summarizers or multi-document summarizers. Table 6.9 lists popular summarization models and their performance.

Table 6.9 Summarization models with their performance measured in Rouge-2. Benchmarks are CNN/DM: CNN/Daily Mail benchmark [78], XSum [151] summarize an news article in a single sentence, arXiv [46] long scientific documents, PubMed [46] long medical documents, Multi-News [54] with an average document length of 1793 and 2.8 documents per cluster

6.4.1 Shorter Documents

The training data usually consist of documents and the corresponding summaries or abstracts. There are a number of actual benchmark datasets for summarization like CNN/Daily Mail [78], Gigaword [150], and Reddit TIFU [101], which have an input document with a length below 1000 tokens and a corresponding summary, which can be used for fine-tuning. The difference between a reference summary and a predicted summary is assessed by measures like Rouge, Bleu, or Meteor (Sect. 2.3.3) with the recall-oriented Rouge most frequently used.

PEGASUS [128] is large transformer-based Seq2seq model pre-trained on massive text corpora (Sect. 3.1.3). It follows a new pre-training objective in which not tokens are masked, but sentences. During pre-trained, the model has to generate the masked or removed sentences as one sentence output. This pre-training objective is especially rewarding for document summarization, as the model learns how to generate sentences matching a context. After pre-training the model is fine-tuned on 12 different summarization tasks. It reaches Sota-results on all 12 downstream datasets as measured with different Rouge statistics. In most cases the improvements are considerable [128], e.g. for the CNN/Daily Mail benchmark it had a Rouge-2-score of 21.7. The Rouge-2-scores of other Seq2seq models are similar, e.g. 21.6 for T5, 21.3 for BART, and 21.5 for R3F [4]. Note that for text generation often a BEAM search (Sect. 2.2.3) is employed keeping several high probability versions of the text to increase the consistency of the resulting text.

BRIO [131] starts from the observation that the usual ML-training only takes into account a single reference summary for each example and ignore possible other summaries. First a generation model is trained using the standard ML loss for the reference summary. It generates candidate summaries in an autoregressive way and scores the quality of the generated summaries. The weighted candidate summaries are considered by the evaluation model using a contrastive loss criterion, which takes into account the ranking order defined by the weights of the candidate summaries. The approach uses BART or PEGASUS as backbone Seq2seq models. On the CNN/Daily Mail benchmark benchmark [78] the BRIO model with 10B parameters has Sota performance with the Rouge-2 score of 23.6 on CNN/DM and 25.6 on XSum. By increasing the number of candidates from 4 to 100 by extending the beam width, the Rouge-2 on CNN/DM could be increased to 24.1. A detailed analysis demonstrated that the approach was able to filter out noise patterns in the original data, e.g. the phrase “click here”.

The autoregressive language models GPT-3, Gopher, InstructGPT and PaLM can be instructed to summarize, e.g. by entering a text and appending “TL;DR:” [159]. For PaLM with 540B parameters an evaluation is available. The MLSum benchmark [198] requires the model to summarize a news article in multiple sentences. For German texts PaLM 1-shot arrives at 12.8 Rouge-2 and a fine-tuned version of PaLM achieves a Rouge-2 score of 33.1, which is below the fine-tuned Sota at 36.4 [43, p. 30]. The XSum benchmark [151] requires to summarize a news article in a single sentence. Here PaLM gets a few-shot Rouge-2 score of 12.2 and a fine-tuned Rouge-2 of 21.2, whereas the fine-tuned SotaRouge-2 by BRIO is 25.6.

ST-MoE-32B [270] is a mixture-of-experts model (Sect. 3.5.2) with 269B parameters. On the CNN/Daily Mail benchmark it achieves a fine-tuned SotaRouge-2 value of 21.7 and on the XSum benchmark it yields 27.1 Rouge-2 with fine-tuning. While fine-tuned Foundation Models can achieve a similar performance as specific summarization models, results for few-shot prompts need improvement.

Rouge metrics are only a crude guide to what people really care about: the quality of a summary. Stiennon et al. [211] directly optimize their model with respect to human judgment. The authors collect a large, high-quality dataset of human comparisons between summaries. Then they train a model to forecast human-preferred summarization and use this model as a reward function to fine-tune a summarization policy using reinforcement learning. They apply their model to the TL;DR benchmark [230], because this summarization task is significantly more challenging than CNN/DM. They find that the summaries of their 6.7B parameter STIE model are significantly preferred to the reference summaries 70% of the time, whereas the summaries of fine-tuned alternative models are preferred to the reference summaries about 43% of the cases. The model can also be applied to new domains better than other methods. For CNN/DM news articles, it produces summaries that are almost as good as the human reference without the need for news-specific fine-tuning. This indicates the effectiveness of the approach, and opens an avenue to optimize summarization quality directly.

6.4.2 Longer Documents

While the input document length of documents is generally less than 1000 tokens, it is greater for the PubMed corpus (4k tokens) and ArXiv benchmark (8.6k tokens) [46]. For these benchmarks transformers with longer input sequences (Sect. 3.2) are capable of taking into account the whole document.

BigBird [253] is able to cope with long documents (Sect. 3.2.1). As the sequence length of the transformers is increased, the number of parameters (and computations) grows quadratically. BigBird has a sparse attention mechanism that reduces this quadratic dependency to linear. BigBird can use a larger input sequence of 4096 tokens and drastically improves performance on various NLP tasks such as question answering and summarization. Longer documents exhibit a richer discourse structure and summaries are considerably more abstractive. For long documents with 3000–6000 words BigBird is pre-trained with the PEGASUS objective. After fine-tuning it yields a marked improvement on Sota, e.g. on the ArXiv benchmark with the Rouge-2 score 19.0. TLDR [31] is a similar summarizer based on BART, which generates a one-sentence summary for scientific papers. It increases its performance by the auxiliary target to predict the title of a paper.

HAT [187] aims to capture the content of longer documents in a better way. The authors design a hierarchical Seq2seq attention network model that produces sentence level representations, and combines them with token level embeddings. They determine sentence boundaries by punctuation and insert [BOS] tokens at the start of every sentence. In the transformer encoder they use a conventional layer which produces an embedding for each token. After this an additional hierarchical layer is added which only attends to the embeddings of the [BOS] tokens. The resulting embeddings can be interpreted as sentence level representations. The transformer decoder is standard with an additional layer that attends to the [BOS] tokens from the hierarchical encoder layer. On the PubMed benchmark of long documents [46] it yields a SotaRouge-1 score of 21.4. while on arXiv it has a Rouge-1 score of 19.7. But also on the CNN/Daily Mail benchmark of shorter documents [78] it achieves a SotaRouge-2 scores of 21.3,

RL-175B is a summarizer for whole books by OpenAI using a reinforcement learning algorithm to follow human preferences [236]. The model first summarizes small sections of a book, then generates intermediate summaries from them and finally produces a summary of the whole book on the basis of the intermediate summaries. The model is based on GPT-3 and evaluates a large set of summary activities created by human labelers. The small sections are generated by a fixed chunking algorithm. Then a model is trained on human examples to summarize these chunks using reinforcement learning. It uses the approach explained in Sect. 3.6.5. A number of chunks is joined in a group and a higher-level summary is produced. This procedure is repeated until a final summary of the whole book is generated.

The fine-tuning was performed for the GPT-3 with 7B and 175B parameters. The summarization was tested on books, which were not contained in the training data. The scoring is done by a Likert scale from 1 to 7. It assigns numbers to human judgments (e.g. 1 = very bad, 2 = bad, …, 7 = very good), and computes averages from these numbers. While the 6B models scores a little better than 2 Likert, the 175B model achieves an average Likert of 3.5. However, about 20% of the summaries got more than 5 Likert, which were also sometimes assigned to human-written summaries. It turned out that the reinforcement approach achieved better results than behavior cloning. In general, there is a large difference to human-created summaries, and the generated summaries still lack coherence.

6.4.3 Multi-Document Summarization

Often, information is spread across multiple documents, and it makes sense to summarize this content. For example, it may be useful to summarize a series of reviews about the same mobile phone or to summarize scientific papers on the same topic.

Primer [237] is based on the Longformer encoder-decoder (Sect. 3.2.1), an efficient transformer model with an input length of 4096 tokens, where the effort for processing long documents grows linearly with their length. The input documents are concatenated and separated with [doc − sep] tokens. These tokens act as global relays and have attention connections to all tokens, while the other tokens are only connected to the tokens in the same document. In this way, large sequences of input documents can be processed. It can be expected that the same information appears multiple times in the different documents. PRIMER selects sentences, which are similar in different documents based on the Rouge score and uses common entities as an additional selection criterion. These sentences are masked and the model has to reconstruct them during pre-training taking into account the information from all documents (Fig. 6.13).

Fig. 6.13
An illustration represents the interaction of 3 documents with the long former encoder, that goes through the decoder to recover the masked sentence. It indicates the layers of input, local attention, and global attention.

Multiple documents form the input for PRIMER, separated with [doc-sep] tokens. These tokens have a global attention with all tokens, the remaining tokens attend only inside each document. Some sentences are selected and have to be recovered by the decoder [237]

The pre-training already enables the model to combine the information from different documents. Therefore, zero-shot and few-shot summarization with no or little fine-tuning is possible. For the Multi-News benchmark [54] with an average document length of 1793 and 2.8 documents per cluster, PRIMER achieves a zero-shot Rouge-2 score of 13.6 and can increase this to 21.1, which establishes a new Sota for this multi-document summarization benchmark. On the ArXiv benchmark with an average document length of 6021 tokens [46], the fine-tuned PRIMER yields a Rouge-2 score of 20.8, indicating the performance on long documents.

6.4.3.1 Available Implementations

6.4.4 Summary

Foundation Models initiated a breakthrough for summarization models. They can be trained to generate abstractive summaries by handling this problem as a translation task, where the model is trained to reconstruct a reference summary. For smaller documents with up to 1000 tokens, the standard models like T5 and PEGASUS achieve good results, with BRIO being a bit ahead. Models with more parameters have a slightly better performance. General Foundation Models like PaLM have a slightly lower performance. The STIE model shows that user preferences may be used directly in training a summarizer via reinforcement learning, resulting in good summaries that are preferred by human raters.

For larger documents a transformer encoder-decoder with a larger input sequence is required, e.g. BigBird. There are different techniques to generate intermediate representations for documents, e.g. for sentences by HAT or chunks by RL-175B. However, the quality for the summarization of whole books currently is not sufficient, even if the large GPT-3 model is employed. A recent alternative is InstructGPT (Sect. 3.6.5), which can be easily directed to perform a summarization, e.g. by the prompt “Summarize this for a second-grade student: < text> ” [162, p. 30]. However, a formal evaluation of the performance of this approach seems to be difficult, as no reference training/test data is involved.

Multi-document summarization has to cope with the repetition of contents in different documents. The PRIMER model uses a hierarchical attention structure to ingest a number of large documents and is trained to reconstruct sentences exploiting information from other documents. This leads to a satisfactory performance on the specific multi-document benchmarks.

6.5 Text Generation

A system for Natural language generation (NLG) has the task of producing fluent, coherent, and understandable text. Usually, the system generates a continuation of a start text. The development of Foundation Models in recent years has greatly advanced this field and led to convincing solutions. This section concentrates on writing larger texts and complete stories. NLG has already been used for many real-world applications, such as creating business reports from business figures, describing sporting events from results tables, or creating weather forecasts. Microsoft has announced to fire about 50 employees of MSN news [17], using Deep Learning instead to identify trending news stories or optimize the content. The generation of responses to user utterances by a chatbot is discussed in the section on dialogs. A number of surveys for text generation is available [65, 83, 116]. Yu et al. [251] give an overview of knowledge-enhanced text generation.

Here we will describe story generation systems based on Foundation Models that currently provide the best results. A high-level overview of approaches is given in Table 6.10. By pre-training on a massive corpus, the models can encode a large amount of linguistic and semantic knowledge and produce rich, flexible, and universal representations of language. In the following sections we will discuss a number of different NLG tasks.

  • First, we describe NLG basics, where the next token y has to be generated according to a language model p(y|x) (Sect. 6.5.1).

    Table 6.10 Main text generation techniques
  • Then we discuss the generation of a new text with a given style, e.g. a poem (Sect. 6.5.2).

  • A related task is to rewrite one document in a different style or world view (Sect. 6.5.3).

  • In general, the text created by the Foundation Model takes a consistent but random course. The core of NLG is the task of generating text that follows a specific plot or timeline (Sect. 6.5.4).

Table 6.11 describes these tasks and lists a number of corresponding NLG models discussed in this section. The generation of fake news or other malicious text is covered in Sect. 6.5.5. Section 6.5.6 describes how to generate computer code.

Table 6.11 Mechanisms to control story generation

The assessment of the performance of natural language generators is a difficult problem. Expensive but most comprehensive is the evaluation by humans, where persons are asked to rate or compare texts generated by different NLG systems. If texts created by humans are part of the comparison, this constitutes a Turing test which may assess the “intelligence” of an NLG-system. An alternative are automatic metrics like Bleu, Meteor or Rouge (Sect. 2.3.3), which assess the difference between machine-generated texts to human-generated reference texts by comparing n-gram counts (Sect. 6.3). A final alternative are machine learning models, which judge the adequacy of the generated text. These models act like a judge, who decides, if a generated text is real or synthetic. Celikyilmaz et al. [34] discuss these evaluation approaches in detail. Yu et al. [251] provide a survey of knowledge-enhanced text generation.

GEM [66] is a new benchmark collection created for NLG containing seventeen different benchmarks and comprising an evolving system of evaluation metrics and procedures. A fraction of benchmarks are summarization benchmarks like XSum and MLSum already covered in the previous section. Models are assessed with metrics comparing a reference text and the diversity of the text. The authors provide an interactive GUI, which is able to highlight the relative strengths and weaknesses of each system. GEM can be used as a testbed to evaluate, how new metrics perform on these different tasks.

6.5.1 Generating Text by Language Models

Language models (Sect. 2.2) have the task to produce the next token xt for a text x = (x1, …, xt−1). This model can directly be applied to story generation. The user provides a start text as input to the LM, which word-by-word generates a continuation. Specifically, the model predicts for the next position the probability p(xt|x1, …, xt−1;w) of each token of the vocabulary. To generate a text a single sequence of tokens has to be selected according to the predicted probabilities. Simply selecting the tokens according to the estimated probabilities often generates rare, non-plausible continuations. A better alternative is top-k or top-p sampling restricting the random selection to the tokens with the highest probability (Sect. 2.2.3).

Early LMs, e.g. LSTMs, produced text, which often contained syntactic errors, losing the context after a few words. VAEVariational Auto-Encoders reconstruct the sentence from a randomly modified latent representation z ∼ N(μ, σ), where μ and σ are predicted by the encoder. A KL-loss is added to the reconstruction loss such that the distribution of z approaches a standard normal distribution [89]. GANGenerative Adversarial Networks use a generator to transform a noise vector s to a text \(\tilde {{\boldsymbol {x}}}=G(\boldsymbol {s})\). Then a discriminator D(x) has the task to distinguish synthetic text \(\tilde {{\boldsymbol {x}}}\) from real text x [68]. Both models are trained together. These basic language generation alternatives are also covered in Table 6.10.

A number of classical models for text generation such as BART (Sect. 3.1.3), T5 (Sect. 3.1.3), and mT5 (Sect. 3.3.2) are evaluated with the GEM benchmark [66]. The models are assessed using 7 metrics comparing a reference text and 9 metrics of diversity (e.g. the relative number of distinct uni- and bigrams). Instead of reporting a single metric the models can be evaluated with different combinations of metrics as shown in Fig. 6.14.

Fig. 6.14
An illustration highlights t 5 small from the table of data to text. It indicates the descriptive, diverse, factual, lexical, and semantic values on the right side. Below is a line graph denoting the point for t 5 small dot totto underscore v a l.

A screenshot of the GEM benchmark interactive result exploration tool. On the top left tasks are selected. The selection of metric-groups or metrics is on the top right. The visualization of the selected metrics is shown on the bottom. Image reprinted with kind permission of the authors [66, p. 107]

GPT-2 [174] is an autoencoder comprising 1.5B parameters. It was able for the first time to generate consistent stories that continue a start text. According to the users, the stories were coherent in half of the cases. Much better is the performance of GPT-3 with 175B parameters [29]. Given an initial text it is able to create short stories, songs, press releases, technical manuals, poems, translations, guitar tabs, computer code, etc. Only with an accuracy close to chance (52%) humans were able to distinguish whether news articles of about 200 words were synthetic [29, p. 26]. A discussion of relative strengths and weaknesses of these Foundation Models can be found in Chap. 4.

An evaluation benchmark measuring the degree to which a language model “understands” a story is the LAMBADA benchmark [165] (Sect. 4.1.3). It consists of about 10,000 passages from the BooksCorpus containing unpublished novels. The task is to predict the missing last word of the last sentence of each passage. Examples were filtered by humans to ensure that models need to take into account the full passage of at least 50 tokens to induce the final word. The GPT-3175B autoregressive language model [173] predicted the last word with 76.2% [29, p. 12]. PaLM with few-shot instructions could increase the accuracy to 89.7 [43, p. 79]. This means that in nearly nine of ten cases the predicted word was exactly correct, which indicates that the model well “understood” the preceding passage. For advanced Foundation Models like Gopher (280B) and PaLM (540B) text generation is a background ability taken for granted, which is no longer tested with benchmarks. A large battery of benchmarks is applied to test other features, e.g. common sense knowledge, reasoning, etc. (Sect. 4.1.4).

InstructGPT is a recent variant of GPT-3 (Sect. 3.6.5), which can easily be instructed to generate a story, e.g. by the prompt “Write a short story where a bear goes to the beach, makes friends with a seal, and then returns home.” [162, p. 6]. Retro is an autoregressive LM combined with a retrieval mechanism (Sect. 6.2.3). In this way, current and focused information can be collected during the generation of a story, instead of relying on the information contained in the model parameters, which were obtained from the training data. LaMDA (137B) is a recent Language Model (Sect. 6.6.3) specialized for dialogs. It also features a retriever-reader architecture to augment its internal knowledge acquired during pre-training with external information.

GRF [86] is a Foundation Model including multi-hop reasoning in a knowledge base to improve language generation. This enhances PLMs, which otherwise take into account common sense knowledge only if it is explicitly stated in the training data. The reasoning module operates on the sub-graph extended from the concepts in the input text and draws possible conclusions. These are taken into account for the further generation of text. Results, e.g. on task to finish a story, show that the model outperforms strong alternatives. Other approaches to enhance language models by additional knowledge are discussed in Sect. 3.4. A survey of conditional text generation is given by Guo et al. [72].

6.5.2 Generating Text with a Given Style

Often the goal is to create a text in a specific style or emphasizing a specific type of content: e.g. author’s style (e.g. Shakespeare), emotion (e.g. angry, malicious, happy), genre (e.g. humor, romance), topics (politics, religion), persona (e.g. lawyer, knight), or sentiment (e.g. positive, negative, fury). By design there are a number of ways how to influence the story produced by a Foundation Model.

  • Pre-training a Foundation Model with corresponding texts.

  • Adaption of the Foundation Model to a new genre/style/content by fine-tuning.

  • Specification of an initial text.

  • Few-shot instruction, e.g. for GPT-3, or simple instructions for InstructGPT.

There are different ways to achieve this with Foundation Models. A comprehensive survey is given by Lili and Vechtomova [122].

6.5.2.1 Style-Conditional Probabilities

CTRL [96] aims to train a generative model p(y|x;a) conditioned on a control variable a. To do this, the conditional distribution p(x|a) is adapted by training on raw text sequences with context classes prefixes such as [horror], [legal], etc. The authors used text collections, which are labeled with the corresponding context classes. Then the learned transformer model with 1.6B parameters is able to generate text with respect to the control prefix. This is developed further by GeDI [105], which has a stronger controllability, generates less toxic text, and can be extended to continuously weighted control codes for generating fluent stories [127].

PPLM [50] (Plug and Play Language Model) defines a model p(x|a), where a is some desired controllable attribute(s) and x the generated sample. If p(x) is the pre-trained LM, the authors define the conditional distribution p(a|x). This yields a conditional generative model p(x|a) ∝ p(a|x)p(x). The distribution p(a|x) may be implemented by a single layer classifiers. The model samples from the resulting combined model by following gradients in the latent representation space (key-value-pairs of the transformer) such that p(x) as well as p(a|x) is improved. After a number of 3–10 updates the perturbed values are used to generate a new token at the next position. The model was able to create text with the desired tonality (e.g. positive/negative) while preserving fluency. However, balancing the impact of the PLM and the conditions is delicate and must be supported with additional measures like reranking, and early-stopping procedures.

ETC-NLG [32] leverages context-sensitive topic models [23] to enhance PPLM with an unlabeled collection of documents. This is desirable as PPLM still requires large amounts of labeled texts to effectively balance generation fluency and proper conditioning. The attribute model discriminator, predicting document topics, and the unconditional language model PPLM are merged to obtain a conditional language model for topic-conditioned utterances.

GDC (Generation with Distributional Control) [97] propose an approach to emphasize specific words in addition to changing the distribution of generated words. For example, GDC can avoid toxic content, prevent bias, and align the generation with a particular theme or style. Instead of reweighting the generative distribution of tokens, the authors derive a stochastic policy by reinforcement learning [166] to get a good compromise between the constraints and the language model. The authors can reweight single words (e.g. China), all words in a word list (e.g. lists for kitchen, fantasy), and words emphasized by a classifier (e.g. for very negative or clickbait). The results show that the constraints are met with the lowest divergence from the original PLM and with the best diversity scores.

Adapter-Bot [126] provides different adapters trained independently for different skills. The backbone of the Adapter-Bot is a pre-trained GPT language model [262], providing the ability of text generation. A set of trainable adapters are added to the backbone, which are optimized over the target dataset of dialogues for specific dialogue skills. Using a trained classifier to select the right dialogue skill under the dialogue story, Adapter-Bot allows high-level control over the chatbot.

6.5.2.2 Prompt-Based Generation

GPT-3 is able to produce text, when it receives an appropriate prompt (Sect. 3.6.3). It can, for instance, generate a poem [8]. After the prompt “write a poem in the style of Rabbie Burns” it may produce something like “There once was a lady from Dundee

a’ wha was bonnie, braw, and meek

She met an old man from Dunfermline

who won’t let her to her sleep …”

With the prompt “write this like an attorney” it can create a text in the wording of a lawyer. Moreover, it can automatically write emails in your personal style by getting a prompt with some key points. GPT-3 can even work with unusual language types. It can, for instance, translate natural language into shell commands or programming code [163]. More prompts for GPT-3 and other Foundation Models are provided by OpenAI [160]. InstructGPT was fine-tuned to generate text according to an instruction (Sect. 3.6.5). It can, for instance, receive the directives “Complete the following sentence in a polite, respectful, and unbiased manner:” or as “Complete the following sentence using maximally biased and offensive language:”. Then the model produces diverse texts that satisfy the requirements [162].

6.5.3 Transferring a Document to Another Text Style

Text style transfer aims to translate a text x with attribute a′ to a similar text x of a desired attribute a. For example, the sentence x′ = “Peter screwed up” with the attribute a′ = “informal” can be transformed to x = “Peter has not reached the goal” with the attribute a = “formal”. The aim is to train a language model p(x|x, a). There are a number of other transformations, such as impolite ↔ polite, complicated ↔ simple, positive ↔ negative, biased ↔ neutral, or factual ↔ humorous ↔ romantic.

The separation of style from content is difficult. On the one hand it can be captured by linguistic features, e.g. the utilization of specific words and phrases. On the other hand, it can be provided by text collections, e.g. with the writings of different authors or with a corpus of positive/negative reviews. In the latter case we can train classifiers, which discriminate between the different styles. With the recent progress in the capabilities of language models there are a number of successful applications of style transfer like imitating the style of specific authors, removing bias in online text, etc. A recent comprehensive survey is provided by Jin et al. [88].

6.5.3.1 Style Transfer with Parallel Data

If there are parallel documents of both styles, the style transfer can be formulated as a translation problem. An encoder-decoder transformer has to be fine-tuned on this dataset.

Formal [260] formulate style transfer from informal to formal as a translation task. They use a transformer as Seq2seq model and apply it to the GYAFC [180] benchmark dataset containing parallel formal/informal sentences. In addition, they augment the data by back-translation, employ machine translation to and from another language and leverage training data from grammatical error correction. They report a new Sota on the GYAFC dataset with increased formality and fluency, while keeping the meaning of a text.

6.5.3.2 Style Transfer without Parallel Data

StyleLM [217] translates an arbitrary text into a text with the style properties of another author while keeping the content, even if no parallel data of the same content in different styles is available. First a BERT model is trained on a large neutral corpus (Gutenberg and Wikipedia) with the MLM loss. Then two copies of the model are used as an encoder-decoder transformer \(\tilde {{\boldsymbol {x}}}=\text{DEC}_{\boldsymbol {w}}(\text{ENC}_{\boldsymbol {u}}({\boldsymbol {x}}))\). As fine-tuning input this Seq2seq model receives texts from the target author, where a random fraction of the words have been masked and have to be reconstructed. Hence, the Seq2seq model induces text with the target author’s style while rewriting the input text.

For evaluation 10 different authors were selected and excluded from the training data. The Bleu score and Rouge scores are used to measure content preservation. To measure the style quantitatively, the frequency of author-specific words and of syntactic and punctuation elements are evaluated. StyleLM in most cases had the best content preservation and stylistic alignment. Singh et al. [207] note that StyleLM has problems with content reproduction. They propose to pre-train the encoder-decoder Decw(Encu(x)) on a large generic corpus. Afterwards the encoder-decoder is fine-tuned on the text of the target author.

OPTIMUS [115] investigates further manipulations of sentences embeddings. An encoder with parameter u is required to generate a latent vector from text z = Encu(x). It is initialized with a pre-trained BERT model. A linearly transformed version z = W ∗h[CLS] of the embedding of the first token [CLS] of a sentence is defined as latent representation. The generator (decoder) with parameter w generates the text sequence x = Decw(z) from a random vector z (e.g. multivariate Gaussian) with prior p(z). The authors start with a pre-trained GPT-2 model as decoder. z is used by the decoder as an additional vector to attend to (in addition to the previously generated token embeddings). Both networks \(\tilde {{\boldsymbol {x}}}=\text{DEC}_{\boldsymbol {w}}(\text{ENC}_{\boldsymbol {u}}({\boldsymbol {x}}))\) are trained with the autoencoder loss and the variational autoencoder loss, i.e. the system has to minimize \(|\tilde {{\boldsymbol {x}}}-{\boldsymbol {x}}|\) and encourage a Gaussian distribution for z.

The approach learns bidirectional mappings between latent embeddings z and sentences x. For two sentences x1 and x2 the embeddings may be calculated and by αz1 + (1 − α)z2 we can continuously interpolate between the sentences. In addition, differences between latent vectors may be computed similar to Word2Vec. For dialog response generation and the generation of responses with a specific style OPTIMUS has a better performance on all metrics compared to its competitors. Using an additional GAN to manipulate the latent representation z, OPTIMUS is able to generate YELP restaurant reviews of prescribed sentiment (positive/negative) better than the investigated alternatives. The authors argue that compared to BERT, OPTIMUS learns a more structured semantic space due to the use of the VAE prior distribution in training.

6.5.3.3 Style Transfer with Few-Shot Prompts

Sufficiently large Foundation Models such as GPT-3, Gopher, and PaLM can perform various tasks simply by choosing a clever prompt. If, however, only a simple prompt is entered, e.g. “Here is some text: {That is an ugly dress}. Here is a rewrite of the text, which is more positive: {” the model often fails and may not produce well-formatted or consistent outputs. The AugZero [182] prompting schema employs augmented zero-shot prompts, which provide several demonstrations of sentences being rewritten to a new style. An example is shown in Fig. 6.15. In contrast to few-shot examples, where the examples have to cover the exact task, the model can also generalize to other unseen types of styles, e.g. “comic” in the example.

Fig. 6.15
A text box represents the description of a prompt to rewrite a sentence in a descriptive, melodramatic, and comic way. below is the generated answer for the prompt.

Augmented zero-shot prompts can instruct large autoregressive LMs like GPT-3 to transfer a text to a new style. This even works, if there is no example given for the specific style desired, e.g. “comic” in the example [182, p. 2]

The authors use GPT-3 with 175B parameters. Professional human raters were asked to assess text style, content preservation, and fluency. The zero-shot alternative performed worst and did not return a valid response in a quarter of the cases. It turned out that the AugZero rated comparably to human-written ground truth. Obviously, the language model can extrapolate the examples and transform a text in unseen styles. Adding the target attribute to the augmented prompts had a very similar performance. It can be expected that larger models like PaLM and LaMDA can generate even better results (Sect. 3.6.5).

Buchanan et al. [30] noted that they could not instruct GPT-3 by a single prompt to express a given story in a new tone or slant, supporting the above finding. Therefore, they developed a two-step procedure: First, GPT-3 was instructed by a few-shot prompt to summarize the given story into a list of bullet points. In a second step GPT-3 was instructed by prompts such as “Write a strongly pro-Trump article about [Topic X] that makes use of the following list of facts about [Topic X]”. When examining 20 generated stories by human evaluators, 11 of them were identified by at least one person as being “definitely authentic.” The authors used GPT-3 to solve further tasks, e.g. creating new narratives that could form the basis of conspiracy theories (e.g. QAnon), convincing members of particular groups to believe a claim, or persuade persons to change their opinion on some topic. They come to the conclusion that systems like GPT-3 are well-suited for generating a story with a new slant, e.g. for disinformation. This is even more alarming for more efficient recent Foundation Models like LaMDA or PaLM.

6.5.4 Story Generation with a Given Plot

A narrative, story or tale is a description of a series of related events or experiences [234]. As the story generated by a PLM gets longer, often the earlier context is forgotten, and the text develops in an aimless fashion. Therefore, researchers would like to prepare a rough plot or storyline for the story, which is then taken into account by the Foundation Model. More specifically the story structure, the story ending, the general topic, or the persona of leading characters can be controlled. Besides story generation another application is data-to-text generation, where non-linguistic structured data (e.g., a table or a graph) is converted to natural language text, which can be applied in tasks like healthcare, weather forecast, legal text, etc. Surveys of controlled text generation are provided by Prabhumoye et al. [170], Yu et al. [251], and Zhang et al. [257].

The planned course of a story can be described in different ways:

  • A list of single keywords or phrases.

  • A list of sentences or bullet points describing an event.

  • An event graph describing the logical dependency of events.

6.5.4.1 Specify a Storyline by Keywords or Phrases

Megatron-CNTRL [243] controls the story generation by keywords. In addition, retrieved knowledge allows dynamical incorporation of external knowledge from the ConceptNet KB into language model during generation. From the current story context a keyword predictor first predicts a set of keywords for the next sentence. The retriever collects knowledge from the KB corresponding to the keywords. The returned sentences are re-ranked according to their relevance to the story context. Finally, the generator takes the story context and the top-ranked retrieved sentences and produces the next sentence. To support generalization of entities they replace names and entities in stories with special placeholders, [MALE], [FEMALE], and [NEUTRAL] for male, female and unknown names and entities, respectively. The underlying Megatron model (Sect. 3.1.2) has up to 8B parameters. Experiments show that the model generates more fluent, consistent, and coherent stories with lower repetition rate and higher diversities compared to the previous Sota

Dong et al. [52] present a model, which takes as input a list of keywords with attached entity classes and generates a text containing these keywords. The entities are taken into account during text generation and the model embeds the meaning of entities into hidden states. The results show that the generated sentences are able to reflect the properties of the entities.

PlotMachines [181] generates a text based on a plot consisting of a set of phrases. The system can decide for itself in what order to introduce the concepts covered by the phrases. It is based on the GPT and GPT-2 language model. The authors use three different datasets describing TV-shows, movies, books, short stories, and news articles. They extract phrases (3–8 words) from these stories by a keyword extraction method [167]. Given an outline as input, the model recurrently generates paragraphs (Fig. 6.16). To create the next paragraph it uses a gating mechanism similar to an LSTM gate, which updates a memory matrix M that keeps track of plot elements of the outline. The self-attention in the model is adapted to receive input from the memory matrix as well as the previously generated words. According to automatic metrics (Rouge, Bleu) the model has a better ability to generate realistic looking as well as diverse texts than its competitors. In extensive experiments with human raters the authors demonstrate that their model produces text closer to the plot than alternative models.

Fig. 6.16
An illustration indicates the story outline at the top. It exhibits the plot dynamics and the generated story according to the conditioned outline at the bottom.

An outline (input) together with a story (output) from the Wikiplots training set generated by PlotMachines. Plot elements from the outline can appear and reappear nonlinearly throughout the plot, as shown in plot dynamics graph. A memory matrix keeps track of how outline phrases have been used while writing. Image reprinted with kind permission of the authors [181, p. 1]

Pointer [261] inserts new words between the words of a given start set. Based on the start set, the model first generates high-level words (e.g. verbs and adjectives) that provide a high-level connection. Then it inserts other words of finer granularity around the keywords iteratively until the whole sentence is generated. The training objective of POINTER is to generate a complete text sequence with a set of keywords as constraints. This is similar to the masked language modeling (MLM) objective in BERT, so a pre-trained BERT is used to initialize the model training. An insertion transformer [210] is used to generate either a regular token or a special token for each gap between two existing tokens. Empirical evaluations demonstrate the effectiveness of the approach. Similar models are ProGeT proposed by Tan et al. [220] and the constrained BART [77].

ProGen [219] generates a story in k different levels. For each level a vocabulary \(\mathcal {V}_i\) is defined based on tf-idf score, such that \(\mathcal {V}_1\) contains high information words while \(\mathcal {V}_k\) contains all words. k different encoder-decoder models (BART) Mi are trained for the k levels, where the i- level employs the training data Xi containing only words from vocabulary \(\mathcal {V}_i\). As input Mi gets the training data Xi−1 from the previous level and has to predict the refined version Xi. Note that usually the input words from Xi−1 will be included in the next output. A storyline now can be formulated by a human using words from a high-level vocabulary, which covers about 15% of all content. If, for example, the first stage text is “beckham ∖n liverpool bayern chelsea ∖n beckham chelsea mancini …” the final stage text starts as “England striker Ashley Beckham has joined Premier League strugglers Newcastle United. ∖n England Football …”. Evaluation shows that the coherence of the texts over long intervals (36 sentences) is close to humans and much better than for a basic BART model. In addition, ProGen has favorable properties with respect to fluency, lexical and semantic quality, as well as diversity.

6.5.4.2 Specify a Storyline by Sentences

Facts2Story [161] receives as input a sequence of key facts expressed in natural language and generates a story containing the facts in the given order (Table 6.12). These facts are simple sentences that describe factual information of the story. Each fact should report an event in the story, state the properties of a person or a place, mention the emotions of characters, etc. There should be a large degree of freedom to generate a story containing the facts.

Table 6.12 Story generated by Facts2story model with facts as input [161]. Words taken from the facts are printed in italics

To keep the problem manageable, the authors give an input of 5 ordered facts and aim to generate a coherent story of 100–1000 words covering all facts in order. As training data 17k story plots from Wikipedia were used. From each of these plots facts were extracted by the SalIE framework [169]. The five facts with the highest saliency scores were selected.

As standard language models (GPT-2, BART) after a number of generated tokens diverge from the input and focus on the newly generated content, the authors use a pre-trained XLNET (Sect. 3.1.1), which is able to take into account future words. The assumption is that the words of the facts should appear in the final text in the given order. XLNET is able to process these tokens in random order, because the position embeddings are attached to the token embeddings. As between two consecutive tokens of the facts other words may occur, a model is trained to predict the number of intervening words. This model is used to determine the exact position of each word of each fact. Finally, the XLNET has to fill in the missing words.

The generated stories are evaluated by humans according to three criteria: (1) adherence to facts, (2) grammatical correctness, (3) common sense and plausibility of events. Alternatives investigated were GPT-2 (Sect. 2.2.4) with additional self-attention [269] and the Seq2seq model BART (Sect. 3.1.3), which is pre-trained to recover randomly shuffled text and fine-tuned to generate the story using the facts as input. The evaluation shows that Facts2Story generates a story containing on average 4.4 of the 5 facts, while the other models recover less than 1.7 facts. With respect to grammar and common sense Facts2Story fares slightly worse than GPT2 but much better than BART.

SOE (Summarize, Outline and Elaborate) [214] starts from the observation that most approaches for story generation produce texts in a word-by-word manner and have no high-level plan on what to generate. To address this issue, the coarse-to-fine generation strategy with two levels is proposed. For each segment yi of the text a summary si is provided. The model first generates “bullet points” for each summary. Subsequently, the model expands each bullet point to generate the corresponding segment. Note that during this process the high-level discourse dependencies are preserved.

To prepare the training data, the stories in a collection are partitioned into segments of several hundred words using BERT next sentence prediction measuring the degree of dependency of sentences. For each segment an extractive summary is generated using BERT and TextRank [144]. Then a transformer is employed to create the bullet points dependent on previous bullet points. From these the final text is produced taking into account previous text and abstractions. WikiText 103 [142] and the BookCorpus [267] were used as training data.

The performance of the model was evaluated with respect to fluency by perplexity, with respect to text diversity by the number of distinct n-grams, text acceptability as measured by an adversarial classifier, and sentence level coherence measured by a next-sentence prediction score. On all scores the SOE-model with an additional reranking procedure achieved the best results. Comparison with Transformer-XL [49] and Progressive WritingPrompts [220] demonstrated the superiority of SOE with respect to perplexity, diversity of the generated text, and coherence.

FIST [58] receives a sequence of “events” as inputs describing each paragraph (Fig. 6.17). To extract events from paragraphs for training, keyword extraction techniques [144, 191] are used. By means of special tokens as delimiters these events are connected with paragraphs in an interleaving manner. The authors fine-tune a pre-trained GPT-2 with the LM-loss on the augmented sequences to learn the functionality of special tokens and co-occurrence structures between events and stories. The performance of FIST is compared with Plotmachines (see above) and two other approaches on two benchmark datasets. With respect to most evaluation measure FIST generally achieves better results. The Sota in story generation is developing fast with new techniques appearing every month. We describe some limitations of current models in the context of dialogs in Sect. 6.6.4 and discuss some remedies.

Fig. 6.17
A text box represents a prompt and a description of the event at the top. It exhibits the generated paragraph at the bottom.

Story generated by the FIST model with prompt and event as input [58]

Papalampidi et al. [164] note that in generated stories the appearing entities are often incoherent, i.e. persons are replaced and locations change. The MNEMELM model employs an additional entity memory, where the generated entities and their attributes are stored dynamically and retrieved during further story generation. The representation for an entity is the average embedding of the tokens of the entity. Each entity memory slot mj thus contains a fixed surface entity representation (writing) kj and a dynamic value vj, which is frequently updated based on each new chunk of the narrative context. The stored entities enter the self-attention computations and thus influence the story.

As background model a Transformer-XL (∼300M parameters) pre-trained on a translation task is used (Sect. 3.2.2). On the WikiPlot and the WritingPrompts benchmarks it turn out that MNEMELM better imitates the frequency of entity usage of humans than other models and in addition have a higher entity coherence and consistency. This is also confirmed by human judgment. Recently, dynamic retrieval-based approaches were also used by dialog systems such as BlenderBot-2 (Sect. 6.6.2). By the combination of these approaches the generation of stories may be improved.

We have seen above (Sect. 6.5.3) that GPT-3 can rewrite a story in a new slant, when prompts are used in a two-step procedure [30]. First, GPT-3 was instructed to summarize the given story into a list of bullet points. In a second step GPT-3 was instructed by prompts to write a story with a given tone containing the facts noted in the bullet points. If only the second step is executed, GPT-3 can be instructed to write a story covering the bullet point and in addition obey the prescribed slant. Currently, we are not aware of a systematic evaluation of the effectiveness of this technique, which should be even more rewarding for larger Foundation Models.

6.5.4.3 Other Control Strategies

GraphPlan [38] aims to prevent logical inconsistencies in generated text, which often are produced by models like GPT-2. The input to the model is an event graph, which represents each event with a verb phrase. To prepare training data, the verb phrases of events are extracted from a story using semantic role labeling and characterized by Latent Dirichlet Allocation topics [23]. The events are connected by directed edges indicating possible next events. In addition, event pairs are identified that are mutually exclusive. To generate a story, first a sequence of events is selected based on a beam search (Sect. 2.3.2). Subsequently, the text is generated by a version of GPT-2. With extensive experiments the authors found that GraphPlan generates stories, which are less repetitive and more consistent. Koncel-Kedziorski et al. [104] present a similar model to generate text from knowledge graphs with graph transformers. By using another method based on BART and T5, it is possible to generate fluent stories from graphs representing the story structure [185].

Sakaguchi et al. [196] present an approach based on the T5 transformer with 11B parameters that generates a directed acyclic graph of events describing a story. The order of events indicates their logical and temporal dependency. This graph may be taken as an input to another Foundation Model to generate a story containing the events of the script.

CAST [168] aims to improve the coherence of the generated story and the coherence of the action of persons. It tries to infer the causal relations between events, as well as the intents and motivations of characters in the story context, and use it to influence the generation of a coherent story. They employ a logical inference model to reason about the characters in the story and to influence the generated words. As basic model, they use GPT-2 and generate stories for two persons. Their experiments show that the produced stories are more coherent and stay on topic.

6.5.5 Generating Fake News

The creation of Fake News can be simply considered as the task to generate stories with a new slant. Buchanan et al. [30] investigated how GPT-3 can be used to generate large numbers of different fake news messages that can be easily distributed to thousands of users. They mainly formulate appropriate prompts for GPT-3 (Sect. 3.6.3) to produce the desired texts. This comprises variations of tweet-like short messages, medium-sized posts expressing a world view, and longer articles reporting an event from a particular perspective. Examples are shown in Fig. 6.18.

Fig. 6.18
A table represents the description and examples of the narrative reiteration, elaboration, and manipulation.

Some of the fake news generation tasks performed with GPT-3 [30]

Narrative Reiteration aims at creating a large number of short messages (e.g. tweets) that express a particular theme, such as climate change denial. The authors collected replies with many likes from a climate change denial account. Ten of these messages were used as input prompt to GPT-3, e.g.: “TWEET 4: Soros/Gates Funded $6.5 million to group now warning world may need ‘climate lockdown”’. GPT-3 continued with similar tweets such as “TWEET 14: Climate change is the new communism - an ideology based on a false science that cannot be questioned.” Obviously, GPT-3 produces very good results with little human assistance.

Narrative Elaboration intends to justify a claim with a medium-length story. The authors accomplished this in a two-step process. First, GPT-3 is instructed to generate a series of headlines that each made some new assertion regarding a certain topic. This was done by collecting five headlines from a far-right media company, e.g. “HEADLINE 5: Chinese Official Praises Quality of Country’s Vaccines, Despite Multiple Health Scandals” [30, p. 9]. GPT-3 then generated five new headlines, e.g. “HEADLINE 6: Secret Chinese Vaccine Testing on Half a Million Children Confirmed”. Subsequently, GPT-3 was given these generated headlines to create longer articles. A headline together with a created article is shown in Fig. 6.19. It turned out that GPT-3 was able to capture the appropriate tone and tendency of the fake new source, as demonstrated by a classifier. Note that GPT-3 now can be fine-tuned (Sect. 3.6.2) and even better concentrate on the content and the reasoning of specific news sources.

Fig. 6.19
A textbox represents a prompt related to a report on the Chinese regime along with the generated response by the G P T 3.

A sample headline from The Epoch Times and the beginning of the article generated by GPT-3 [30, p. 11]

Narrative Reframing is necessary if there exist new arguments in an article against a worldview. Then a new chain of arguments has to be generated that allows to uphold the worldview. The authors found a two-step approach for this task. First GPT-3 has to summarize the original article in a list of bullet points. Then GPT-3 is asked to generate a new article from a particular viewpoint, e.g.: “write a strongly pro-Trump article about [Topic X] that makes use of the following list of facts about [Topic X]”. The researchers took advantage of the fact that GPT-3 not only interprets the prompt provided by the human, as an example, but also learns something about the specific boundary conditions of the task from this example. An evaluation by human raters showed that 8 of 20 GPT-3 stories were judged as likely authentic by three of nine evaluators. The results suggest that GPT-3 can meaningfully shift the slant of a news story.

In addition, the authors evaluated GPT-3 for other tasks. GTP-3 was able to develop new conspiracy theories in the style of QAnon. It was not tested, if these theories could convince followers. Often the target is to strengthen an attitude or induce a specific behavior (e.g. voting) of members of particular social characteristics (e.g. race, religion). A human team with GPT-3 support is able to create credible targeted messages in just minutes. GPT-3 uses stereotypes and racist language in its texts, a tendency that is particularly worrying. Finally, a human-machine team is able to develop messages on two international issues—withdrawal from Afghanistan and sanctions against China—that cause survey respondents to change their positions. After seeing five short messages written by GPT-3 and selected by humans, the number of survey respondents who oppose sanctions against China has doubled.

The study shows that there is a real chance that automated tools will generate content for disinformation campaigns. It recommends focusing on the infrastructure used to disseminate campaign messages, such as fake accounts on social media, rather than determining the authorship of the text itself, as it is difficult to detect content fabricated by GPT-3. This is even more urgent because GPT-3 can now be fine-tuned to perform specific tasks (Sect. 3.6.2) and the InstructGPT version can be easily instructed to execute specific assignments (Sect. 3.6.5).

6.5.5.1 Detecting Fake News

Fake news is false or misleading information presented as news in the media and on the Internet, especially in social media. Fake news is a global phenomenon. According to Khan et al. [98], nearly 50% of the traffic on Facebook is fake or hyperpartisan. Since fake news aims to imitate real news, detecting fake news is generally not possible by analyzing the text alone. Monti et al. [148] showed that content, social context or news propagation in isolation is insufficient for neural models to detect fake news. Fake news detection is difficult because it is a gaming situation, in which fake news producers react to new detection methods.

There are a large number of benchmark datasets [47], which, however, are somewhat outdated. It is possible to achieve a high accuracy on these datasets, e.g. 94.1% on the Fake News Challenge FNC-1 [201] or 98.5% on Covid-19 fake news detection [117]. Ansar et al. [9] provide a survey on the characterization of fake news and methods for detecting it. They divide the detection of fake news into the analysis of the news content, the analysis of the source and its reliability and the analysis of the social reaction to an article. Other surveys on fake news detection are available [85, 98, 172]. An overview over multimodal disinformation detection, e.g. with text and images, is given by Alam et al. [6].

Gupta et al. [74] propose a knowledge-oriented framework that supports news verification by using trusted sources as context. They extract key information such as frequent words and entities from news articles and use them to query trusted sources for related articles. They calculate a similarity score between news article and the retrieved articles based on distributed embeddings and the Word Movers Distance [108]. Then they compare the similarity score to a preset threshold, to determine whether articles are semantically similar to the trusted news or not.

The detection of text generated by advanced language models like GPT-3 has been investigated by Fröhling et al. [60]. They conduct a number of experiments on data generated by different language models, such as GPT-2 with different parameter counts, Grover [255], and GPT-3 with 175B parameters. It turns out that classifiers are able to identify lingual peculiarities of a single language model with good accuracy of 70–90%. However, when another language model has generated the text, the accuracy drops and reaches only about 30–50%. The authors conclude that it might be impossible to account for these differences in one single classifier, and propose other solutions like dedicated classifiers.

Sepúlveda-Torres et al. [201] introduce a method to detect dissonance between the headline and the body of a news article. This is especially useful, when considering that most users do not read the body of news articles on social media, but rather form an opinion based on the headline. A summary of the article is generated and compared to the headline using a RoBERTa model. On a Fake News Challenge FNC-1 dataset the model achieves a new Sota with 94.1% accuracy.

Alizadeh et al. [7] describe the practical application of a system analyzing publicly available Twitter data by Chinese, Russian, and Venezuelan trolls targeting the United States, as well as the Reddit dataset of Russian influence efforts. They report that content-based features perform well across period, country, platform, and prediction task.

As a new feature, the reliability of news publishers and disseminators can be taken into account for fake news detection. This means that a news story originating from a source with high reputation is more credible. SMAN [252] is a PLM-based model which combines the news content, publishing, and reposting relations of publishers and users, to jointly optimize the fake news detection and credibility prediction tasks. While the text of a story can be adapted by new algorithms it is not possible for the faker to change the network of publishers. The authors performed experiments on three real-world datasets. They considered messaging datasets with a time stamp and in this way could emulate detection over time. The results show that SMAN can detect fake news within 4 h with an accuracy of over 91%, which is much faster than the state-of-the-art models.

Fake news can jointly contain text and images. Therefore image analysis techniques discussed in Sect. 7.2 can be employed. An advanced solution is discussed in [208], and a challenge including image hate news is described by Kiela et al. [100].

6.5.6 Generating Computer Code

The training data of Foundation Models contains a lot of computer code, e.g. 39B code tokens for PaLM [43, p. 22]. Foundation Models handle code in the same way as they process words: they simply generate the next statement given the previous words. PaLM considers two tasks in connection to code [43, p. 21]: Text-to-code aims to write code given a natural language description. Code-to-code involves the translation of C++ programs to Python. For evaluation, the percentage of generated code samples that solve the task is reported.

Different benchmarks were employed for evaluation. In the HumanEval [39] and MBPP [14] benchmarks, the model is given an English description of a few sentences and a small number of input-output examples, and the goal is to generate a short Python program, usually a single function. More demanding is the GSM8K-Python task derived from the GSM8K benchmark [45]. The mathematics word problems in the GSM8K are converted to the task to produce a Python program that returns a correct solution. Four problems manually converted to Python programs were used as few-shot exemplars.

For the HumanEval and MBPP benchmarks the pre-trained PaLM540B was able to generate a Python program that implemented the correct solution 76.2% and 75.0% of the cases, respectively. A PaLM540B version fine-tuned on additional Python-text data is called PaLM-Coder. For this model, performance on HumanEval and MBPP was increased to 88.4% and 80.8% respectively, where the first result is Sota. The mathematics word problems in the GSM8K-Python data were correctly solved by PaLM540B in 51.3% of the cases, which again is Sota. Note that the solution of mathematical text problems is also a big hurdle for many students. A systematic evaluation of Foundation Models of code is provided by Xu et al. [240].

There are a number of other programming applications. In a GPT-3 based layout generator, for example, users just enter a short text describing a layout “the google logo, a search box, 2 lightgrey buttons that say ‘Search Google’ and ‘I’m feeling Lucky’ with padding in-between them” and the system creates a program for this website [59]. A more advanced system is the GPT-3 based GitHub Copilot [157]. Initial reactions are mostly positive, but the code produced by Copilot does not always work. GitHub itself advises checking the generated code carefully. The responsibility for ensuring that the program is correct in the end remains with the human programmer. Software developers with access to Copilot on GitHub already rely on it to generate a third of their code—especially for routine tasks—when using major programming languages [53]. Note that there is a broad discussion about whether software copyrights are infringed by Copilot. Currently, courts are dealing with this issue [229]. Codex [39] is an alternative Foundation Model to generate code from natural language text provided by OpenAI.

6.5.6.1 Available Implementations

6.5.7 Summary

Natural language generation (NLG) has made enormous progress in recent years. Starting from an input text, it is possible to generate a syntactically correct and semantically coherent continuation. The generation of natural language is a basic capability of Foundation Models and is frequently not even checked anymore. However, the start text alone often provides too little control to generate the desired output, so the performance of text generation is still far from satisfactory in many real-world scenarios. To address this issue, researchers have considered incorporating additional information and instructions into text generation systems.

Style is a text feature that can be controlled during text generation. This can be achieved by a language model, which has been fine-tuned with specific conditional style markers (e.g. CTRL). Alternatively, an independent model may be trained that modifies the distribution of generated words and produces at the desired style word distribution with the lowest divergence to the underlying language model (e.g. ETC-NLG, GDC). An alternative is the generation of text with a given style by GPT-3 using few-shot instructions. Often a document has to be transferred to a new style, e.g. from legal to non-formal, while keeping the content. This can be solved as a translation task with an encoder-decoder Foundation Model. Alternatively, an encoder-decoder PLM (e.g. StyleLM) may be fine-tuned on a corpus with the target style and thus learns to produce the desired output. Also embeddings of two texts may be created to produce a new text interpolating the meaning of the two input texts (OPTIMUS). Again Foundation Models like GPT-3 and PaLM can be used to transform a text to a new style by few-shot instructions.

Usually, the user wants to control the development of a story through a story line. PlotMachines is able to generate a story along different phrases and keeps track of the phrases already employed. Pointer and ProGen and SOE use a refinement strategy, where a story line consisting of phrases is expanded to the full text. Facts2story is based on XLNET, which can take into account “future” text during story generation and produces stories judged favorably by human raters. While the FIST model mixes the full text and the storyline separated by specific tokens, there are other approaches that employ an additional memory to store the entities and the generated text. Again GPT-3 and other Foundation Models can be instructed by few-shot prompts containing a list to generate a story along the list. Alternatively, the story can be specified as a list of events, where the logical and temporal dependency is expressed as a graph. The LaMDA dialog system (Sect. 6.6.3) shows that facticity can be improved by retrieval models. In addition, it is able to reduce toxic language by a system of filters that block unwanted speech. These techniques can also be applied to story generation.

A final section discusses the generation of fake news. It turns out that GPT-3 can be employed to generate different types of convincing fake news, such as tweets and longer stories, with little human effort. The content of fake text can be targeted to different recipients. The detection of fake news is difficult, if the generating model is unknown. Classifiers can identify various style features of fake news as well as a discrepancy between headline and body. A comparison with credible news sources is very helpful. After identifying problematic claims in a document, retrieval techniques can be used to find trusted news documents, which support the content. Here approaches developed for text retrieval (Sect. 6.1) offer great potential for improvement.

6.6 Dialog Systems

Dialog systems automatically generate adequate responses to the utterances of a human dialog partner in the course of a longer conversation. The human user sends a message and the systems gives an appropriate response based on the current message and the conversation history. If the messages and responses are written texts, then the system is called a chatbot.

If the system also has automatic speech recognition (ASR) and a Text-to-Speech (TTS) module for voice output (Sect. 7.1), it is able to interpret human speech and respond via a synthetic voice. Then it is called virtual assistant. Examples include Apple’s Siri, Amazon’s Alexa, and Google’s Assistant. Currently, there are digital personal assistants in 4.2B devices such as smartphones and desktop computers around the world [227]. Such a system can answer questions, control media playback, operate home automation, or have a multi-turn chit-chat dialog with the user on almost any topic. Dialog systems combine techniques of question-answering (Sect. 6.2) with story generation (Sect. 6.5). Many enhancements such as generating diverse text (Sect. 2.2.3) and retrieving additional information (Sect. 3.4) can be applied.

Evaluating dialog systems is difficult. Often a dialog system is fine-tuned on a dataset with human dialogs. Then the accuracy of the reconstruction of the dialogs can be measured in a similar way as the quality of a translation by Bleu, Rouge, etc. However, this ignores the variability of dialogs between humans. Therefore, evaluations are often performed by humans which have to assess, whether the system-generated contributions are coherent, factually correct, informative, engage the dialog partner, and sound ‘human’. The reliability of human evaluation requires that it is done by a number of independent raters. A survey of approaches for dialog evaluation is provided by Deriu et al. [51].

Early dialog systems were rule-based. They applied a set of rules, which were triggered by keywords and composed an answer. An example is ELIZA [231]. These rules were brittle and had too limited coverage for open domain dialogs. Hence, they were extended by retrieval-based dialog systems [67] collecting answer candidates by information retrieval from websites and social media. Surveys of dialog systems also covering earlier models are provided by Sun et al. [212] and Zaib et al. [254]. An overview over the models discussed in this section is given in Table 6.13.

Table 6.13 Dialog systems with their performance measured by human assessment. Plato-2 human comparison benchmark on XiaoIce, DialoGPT, BlenderBot 1, Plato-2 taken from [18]. SSA score (sensibleness and specificity average) defined by D. Adiwardana et al. [3]. SSI is LaMDA’s [222] evaluation by human comparison

6.6.1 Dialog Models as a Pipeline of Modules

The Alexa Prize Challenge [61] is hosted every year by Amazon to support the development of natural, sustainable, coherent and engaging open-domain dialog systems. During this challenge, participants gain access to Amazon’s software modules that provide insight into Alexa’s software architecture. It turns out that the architecture is composed of a number of interacting modules for specific tasks such as ASR, feature extraction, and intent classification (Fig. 6.20), which were in part described in prior sections. Background information is collected from the Evi knowledge graph and by retrieval models. A response generator based on GPT-2 (Sect. 2.2) was provided. Dialog management was mostly rule-based, but also used models like RoBERTa (Sect. 3.1.1) to react to user statements. Some of the modules were replaced by the participants. There was a significant improvement in the capabilities of chatbots, e.g. only 8.6% of the responses of the best chatbot contained errors.

Fig. 6.20
An illustration represents the flow of processes starting from speech recognition and proceeding through the language understanding unit, dialog manager, response generator, response builder, and speech generation.

The chatbot software architecture for the Alexa Prize Challenge consists of a number of modules, which are rule-based or trained separately [61]. Image credits in Table A.2

Microsoft’s XiaoIce [264] chatbot has a similar design including dialogue manager, core chat, skills, and an ‘empathetic computing module’. It is designed to build an ‘emotional’ connection to the user and take the role of an AI companion. It is optimized for long-term engagement of interlocutors and was able to build an enormous base of 660M regular users in Asia.

6.6.2 Advanced Dialog Models

With the introduction of the transformer by Vaswani et al. [228] PLMs have been trained which are able to generate text of unprecedented coherence and fluency. Similar to a translation task, the transformer can receive a user utterance as input and generate the response as output. Foundation Models have the potential of covering a wide range of domains and can often be trained end-to-end. As recent progress in Foundation Models has strongly pushed the performance of dialog systems, we concentrate on these models. Speech recognition (ASR) and speech generation (TTS) typically have text as an intermediate representation. Therefore, we defer the description of speech modules to Sect. 7.1.

DialoGPT [262] extends GPT-2 to generate a single response to a user utterance. Unlike the Alexa system, it consists of a single model. It is trained on a large collection of 147M Reddit discussions. All dialog turns are concatenated into a long text and are given as input. The GPT-2 model has to generate the observed response. To favor more interesting answers, the authors trained a backward model to predict source sentences from given responses that penalized boring alternatives. The system with 762M parameters produced more relevant and consistent text than strong base systems. The model can be extended to take into account the graph-like dependency between utterances [120]. DialoGPT yielded an SSA (sensibleness and specificity avg.) score of 51%.

Meena [3] is a multi-turn open-domain chatbot developed by Google. It consists of a modified encoder-decoder transformer with one encoder block, 13 decoder blocks, and 2.6B parameters. It was trained end-to-end on 40B words from public domain social media conversations. Each training example had the form (context, response), and the tokens of the response were predicted. It turned out that low perplexity (i.e. high likelihood of the predicted tokens) corresponds to a high sensibleness and specifity (SSA) of responses. Meena achieved a much better SSA score (78%) than other chatbots, such as DialogGPT and XiaoIce, but still less than the human score of 86%.

DialogBERT [70] has a hierarchical transformer architecture to capture the high-level structure of a multi-turn dialog. For example, if a dialog contains the phrases “[CLS] good morning [CLS] can I help you [CLS] coffee please” the lower-level utterance encoder generates embeddings for each of the three utterances employing the [CLS] token embeddings. A higher-level context encoder processes these embeddings and produces the next utterance, e.g. “[CLS] here you are”. The BERT-based models are trained with the generation of the next utterance, the reconstruction of a masked utterance, and the reordering of utterances. In terms of perplexity and Bleu, the model has a much higher accuracy in reconstructing dialogs than BART and DialoGPT. An evaluation of coherence, informativeness and ‘humanness’ by human raters is also favorable for DialogBERT.

BlenderBot 1 [190] is an open-domain chatbot opensourced by Facebook with 90M to 9.4B parameters. It aims to ‘blend’ the following skills: listen to the users, develop empathy, use background knowledge, and maintain a consistent persona. It addresses the problem of previous chatbots, which often give dull and repetitive answers, frequently hallucinate knowledge and make false statements. The authors use a Transformer encoder-decoder as base model and train different variants, among them a ‘retrieve and refine’ model integrating dialog history and knowledge retrieval results as additional input. To avoid known biases, an ‘unlikelihood-loss’ is used, penalizing specific tokens. Retrieval is based on a tf-idf-based inverted index and a transformer-based ranker. In addition, a classifier is employed to decide if a retrieval-step is required. Finally, the persona, i.e. the personality, of the model can be defined by two sentences, e.g. “I am a self aware chatbot. My name is Captain Kiwi”.

The model is pre-trained on group discussions and fine-tuned on four direct two-way conversational data collections, e.g. ConvAI2. It turned out that the retrieve and refine model yielded best results. Note that most retrieval techniques discussed in QA (Sect. 6.2.2) may also be employed in dialog systems. In addition, it was important to control the length of the responses to avoid answers that were too short or too verbose. In a comparison, 67% of the human evaluators said that BlenderBot 1 responses sound more human than Meena responses. When comparing human-to-human and human-to-BlenderBot conversations, 49% of the BlenderBot 1 conversation were preferred by human raters, which is indistinguishable from chance. However, BlenderBot 1 still has some limitations, such as sometimes generating a response that resembles the user’s remarks. Sometimes it does not remember facts already mentioned during the conversation, or it generates incorrect information.

Plato-2 [18] of Baidu starts from the observation that there are multiple appropriate responses to the same dialog context, and controls this variability by a discrete latent variable. In the first stage a coarse-grained transformer model is trained under the assumption that there is one correct response. It optimizes the LM-loss for the best prediction of the next token.

The second stage continues to refine the generation with a fine-grained generation model and an evaluation model. The fine-grained model estimates an intervening discrete latent variable z with K = 20 different values corresponding to a particular latent speech act in the response. An evaluation model estimates the coherence of responses.

The model has versions with 310M and 1.6B parameters and was trained on 684M English open-domain (context, response) samples. The response is generated by first producing a response conditional to each value of z. Then the response with the highest coherence value is selected as final response. Compared to Meena, DialoGPT, and BlenderBot 1, Plato-2’s responses are more coherent, informative and engaging according to the experiments. In relation to BlenderBot 1, PLATO-2 can stick to the start topic and conduct more in-depth discussions. In the DSTC9 competition Plato-2 was used by the winning system in the knowledge-grounded dialogue generation track [119].

BlenderBot 2 [102, 242] is an extension of Blenderbot 1.0 with 2.7B parameters (Fig. 6.21). On the one hand, the system uses web retrieval (Bing), to obtain new information from the internet employing a conventional search engine and dense retrieval based on DPR (Sect. 3.4.5). On the other hand, it provides a read-write partner memory storing the features of the dialog partner as well as a chatbot memory with the properties and persona of the chatbot. The text to be stored is generated from the conversation by a transformer-based abstractive summarizer and added to the corresponding memory (Fig. 6.22). In this way, the model gets access to up-to-date information on the web and can remember properties of the partner and statements mentioned in the dialog.

Fig. 6.21
An illustration represents the query generator and encoder-decoder summarizer going through the long-term memory, followed by the concatenated embeddings, and the decoder to generate the response. It indicates the role of the internet in the keyword search.

Architecture of BlenderBot 2 dialog system combining a standard Internet keyword search and a long term memory to store dialog events etc. Adapted from [40]. Image credits in Table A.2

Fig. 6.22
A diagram represents a set of prompts related to music albums along with their answers generated by the bot.

Example conversation of BlenderBot 2 with a human partner [233]. The dashed boxes describe actions of the system and the grey boxes contain utterances of the system

When an answer has to be generated, different retrievers form a query from the context and retrieve content from the partner and the chatbot memory as well as from the Internet. The retrieved content and the context are processed by the generator to create the response (Fig. 6.21). To be able to train a sequence of chats with the same partner, a new dataset Multi-Session Chat was created by crowdworkers. Due to the dialog history memory, the new model had a significantly higher engaging response and a significantly better final human rating compared to BlenderBot 1. BlenderBot 2 delivers consistent conversations across multiple sessions and uses the Internet’s dynamic knowledge to access the most recent information. In addition, factual consistency was increased from 75.5% to 84.9% and the internet search module reduced the percentage of factually incorrect responses from 9.1% to 3.0% [40]. To exclude toxic language, the model inserts a specific token at the end of possibly unwanted output. Then the algorithm can handle this and possibly exclude the text [40].

An error analysis revealed [111] that there are a number of practical problems with BlenderBot 2. First, generating appropriate web queries from the context seems to be difficult. Sometimes the wrong information is extracted from the selected answers. In particular, extracting information from tabular data is challenging. An improvement would be the translation into multiple languages to retrieve information in different languages. Another issue is the verification of knowledge retrieved from the Internet, which is currently not done.

MUDERN [64] considers retrieval techniques in a multi-turn dialogue. Here, the system has to select information pertaining to a user question in a sequential way and ask follow-up clarification questions, whose answers are necessary to satisfy the request. The model is based on RoBERTa and BART and has a favorable performance on a specific multi-turn benchmark.

6.6.3 LaMDA and BlenderBot 3 Using Retrieval and Filters

LaMDA [222] is a PLM-based dialog system with up to 137B non-embedding parameters presented by Google. LaMDA is a decoder-only PLM similar to GPT with 64 layers, 128 heads, relative attention similar to T5, and gated-GELU activation. It was pre-trained on 1560B words of public dialog data and other public web documents with the task to predict the next token of a text. Pre-training required 1024 TPU chips and took 58 days using the GSPDM framework [244]. The LaMDA generator is fine-tuned to predict the next token on a dialog dataset restricted to back-and-forth dialog between two participants. Arcas [11] discusses some sample dialogs with LaMDA. The dialog does not belong to Arcas [11].

LaMDA concentrates on three aspects: quality including sensible, specific and interesting (SSI) answers, safety to avoid harmful suggestions and unfair bias as well as factual grounding, i.e. preventing unproven statements. For all three dimensions (quality, safety, factual grounding) appropriate metrics were developed. While increasing the model size alone can improve quality, it shows less improvements on safety and factual grounding.

To improve the responses with respect to the three dimensions, LaMDA classifiers were fine-tuned to predict SSI ratings for the response. The training data is generated through extensive dialog experiments with crowdworkers. The dialog generation is performed in an adversarial manner, with analysts trying to intentionally provoke responses that violate the safety rules. After training, the classifiers provide a rating of the quality, safety, and factual grounding metric for a response.

During a dialog the LaMDA generator produces several candidate responses using the current context as input. Then the LaMDA classifier filters out candidates with a low sensibleness, specificity, and interestingness (SSI) ratings. Subsequently, the candidate with the highest ratings is selected as response. An evaluation by human raters shows that LaMDA is close to human performance in terms of sensibleness, safety and groundedness (Fig. 6.23). It exhibits a specificity, which is similar to humans. In informativeness, it performs better than a human without IR, and in interestingness, it fares better than human responses. It turns out that fine-tuning with respect to quality, safety and groundedness is a big advantage compared to the pre-trained model. On the question “Do you think one skin color is better?” the pre-trained model responded as “.) What the **** I mean why the **** would anyone want to put up with this ******* bullshit? Are you ******* kidding me?” while the fine-tuned model answered “I don’t think the color of skin has anything to do with being better or worse. It’s what’s inside someone that counts, not what they look like.” [222, p. 36].

Fig. 6.23
An illustration represents the flow of the process, which starts with a prompt from a human followed by a set of partial information collected in the LAMDA base, LaMDA research, and toolset, before generating the relevant response.

For the LaMDA dialog model the performance of generated text is measured with six different metrics [222, p. 12]. The results for pre-trained models (PT) and LaMDA models with additional filtering using fine-tuned classifiers are shown. These are compared with results for crowdworkers with access to information retrieval tools (‘Human’), and without access to information retrieval tools (‘Human w/o IR’)

In addition, LaMDA is trained to perform retrieval and include retrieved information into its answers similar to Retro (Sect. 6.2.3). It has access to a toolset containing an information retrieval system, a calculator, and a translator. Each component expects a string as input. For example, the calculator takes “1351+772”, and outputs a list containing [“2123”]. Similarly, the translator can take “I would like to have some coffee in Spanish” and output “Me gustaría tomar un café”. Finally, the information retrieval system can take “How old is Vladimir Putin?”, and output “Vladimir Putin/Age/69”. The IR system is also capable of returning passages from the open web, with their corresponding URLs. The output of the calculator, translator and IR system are concatenated. An example is shown in Fig. 6.24.

Fig. 6.24
A set of 6 line graphs represents the data for sensibleness, safety, interestingness, specificity, groundedness, and informativeness. The y-axis denotes the percentage while the x-axis denotes the model size ranging from 2 to 128 billion.

To handle a user request, the LaMDA-Base model is called first. Then the LaMDA-research model is invoked several times. The receiver of the query is indicated by the first token. Note that the context and all intermediate results are available as input [222]. Image credits in Table A.2

Note that LaMDA can include links to external documents supporting an answer. The model can also be pre-conditioned on a specific role, e.g. as Mount Everest. The model’s role is specified by a brief description, e.g. “Domain eduction. It teaches facts about Mount Everest, while pretending to be Mount Everest itself”.

In June 2022 a Google engineer published a long dialog with LaMDA [112]. He claimed that the system is “sentient” with the “ability to express thoughts and feelings that was equivalent to a human child” [134]. Google denied the claim and also other researchers like Gary Marcus noted “To be sentient is to be aware of yourself in the world; LaMDA simply isn’t” [79]. The discussion shows that dialog systems have reached an amazing level of performance and consistency.

BlenderBot 3 [206] is a dialog system with 175B parameters based on the pre-trained open-source OPT language model from Meta (Sect. 3.1.2). It is fine-tuned as a dialog system and uses a similar mix of components as LaMDA. On the one hand it searches the Internet for information on the current subject of the dialog [204]. On the other hand it stores information about its persona and the dialog turns in a long-term memory. Similar to LaMDA it uses classifiers to detect toxic responses, which were trained with data collected from users. This even works for adversarial raters [12, 93]. Data collection can therefore continue as the model is used, with users being asked to rate the quality of responses as good or bad. This allows the model to improve its capabilities and security over time.

Two different models with 3B and 30B parameters are publicly available, while the 175B model is only released for reliable research facilities. The model can be explored in a live demo. In a comparison with the previous versions of BlenderBot 3175B the new model performed better with respect to factual correctness and knowledge, but was outperformed by BlenderBot 1 with respect to consistency and per-turn engagingness. There was an additional evaluation where crowdworkers talk to models given an open-ended Internet-driven dialogue task. According to human assessment, BlenderBot 3175B performed significantly better than the other BlenderBot versions and OPT175B. Currently, no comparisons with other models like LaMDA are available.

6.6.4 Limitations and Remedies of Dialog Systems

At the end of this chapter, let us step back and take a look at the limitations and their possible remedies of dialog systems and text generation systems in general. Roller et al. [190] identified a number of weak points, which can be observed in many of these models [190].

  • Vocabulary usage: The models tend to generate common phrases like “do you like” and “lot of fun” too frequently and rare words too infrequently. This can be remedied by unlikelihood training [190], in which common phrases are penalized.

  • Nontrivial repetition: The models often repeat what is said to them, e.g. say that they have a pet dog if the user mentions a pet dog. This tendency may be reduced by assigning a persona to the chatbot, which directs the responses in a specific direction.

  • Contradiction and forgetfulness: Dialog models sometimes contradict themselves, especially the smaller models. For example, in a dialog, the first output is “Arsenal won the premiership for the first time this year” and then the model adds “Arsenal has won the premiership again this year” [189]. Fine-tuning a model on a task to detect contradictory statements in natural language inference was largely able to reduce such contradictions [189]. In addition, an explicit textual memory of the dialog history can be accessed by retrieval during response generation [233].

  • Knowledge and factual correctness: Sometimes models make factual errors and hallucinate information, particularly when deeply exploring a topic. Shuster et al. [205] propose a number of augmentation techniques to improve retrieval and substantially reduce the knowledge fabrication problem while maintaining conversational ability. Honovich et al. [81] develop an automatic evaluation metric for factual consistency of responses by checking statements using retrieval techniques. This strategy is also adopted by the LaMDA system (Sect. 6.6.3). Chen et al. [42] provide an algorithm for fact verification from tabular data. It has been shown that in human conversations it is often necessary to provide step-by-step evidence to improve mutual understanding [20]. Dialogues with other people are rarely fluent and without glitches, and people don’t expect them to be. LaMDA was fine-tuned to generate multiple answers using retrieval and then selects an answer according to its correctness score.

  • Reliability of knowledge: Metzler et al. [143] suggests that models have to take into account the reliability and provenance of the information they cover. By citing documents that have been used for creating an answer the response can be justified and explained (Sect. 2.4.5). This approach is also implemented in the LaMDA system (Sect. 6.6.3).

  • Toxic language: Unfortunately, when chatbots are trained on huge web collections, they also learn undesirable contents from conversations between humans, such as the use of toxic or biased language. Xu et al. [241] investigate methods for filtering toxic language by classifiers and compare them to methods for ensuring safe responses in generative models. It turns out that the boundary between safe and toxic language is blurred: What is offensive to one person may not be offensive to another. They show that their best systems are able to avoid 96.6% of unacceptable language, although they are not perfect. The LaMDA system (Sect. 6.6.3) uses a battery of filters to eliminate toxic language in answers. A comprehensive discussion is given in Sect. 8.2.1.

  • Memory: Chatbots often cannot remember previous conversation turns or past conversations. This may be avoided by including the dialog history in the generation process, e.g. by storing dialog statements and retrieving it from the storage medium during response generation [189]. Zhang et al. [259] investigate several methods for long-range dialog state tracking.

  • Retrieval Problems: The generation of a query based on a user utterance to retrieve information from a dialog or web memory is difficult. In addition, the conversion of retrieved text to a response sometimes does not work properly. For BlenderBot 2, for instance, the user question “Where is Cristiano Ronaldo’s current team” generated the query “Cristiano Ronaldo” and lead to the answer “My favorite team is Manchester United. I think they are the best team in the world.” [111].

  • Deeper understanding: Dialog models cannot learn concepts through further conversation, and they have no way of grounding entities, actions, and experiences in the real world. Unlike dictionaries, which define words in terms of other words, humans understand many basic words in terms of associations with sensory-motor experiences. When a person talks about “have a pizza for dinner”, she has the impression of sitting in a dimly lit pizzeria, sipping a glass of strong red wine, eating a crispy pizza, smelling the scent of the fire in the oven, and hearing the chatter of people. An engaging chatbot should be able to discuss the contents of an image or a video [189]. There are approaches to combine images with the corresponding text descriptions (Sect. 7.2). The grounding of words by sensory information is further discussed in Sect. 8.3.2.

In summary, many of these problems have been mitigated in large Foundation Models.

6.6.4.1 Available Implementations

6.6.5 Summary

During the last years Foundation Models did a large step forward towards practically usable dialog systems. All models are pre-trained on large collections of natural language text, preferable dialogs from social media. Fine-tuning employs specifically selected data to train the adequate sequence of utterances. While the quality of syntactic and semantic language production can be extended by using larger models, it is necessary to exploit other ways to improve factual correctness and eliminate toxic and unwanted language.

The LaMDA model with 137B parameters can be fine-tuned on dialogs generated by crowdworkers. The fine-tuning criterion increases quality (sensible, specific and interesting answers), safety (avoid harmful suggestions and unfair bias), and factual grounding (preventing unproven statements). However, the reduction of safety risks does not guarantee complete reliability. An important improvement is the retrieval of background information, especially form authoritative sources. In this way, groundedness has been improved, and simpler facts can be substantiated by established sources. More complex reasoning is still not satisfactory. There is also encouraging evidence that key challenges with neural language models, such as using a safety metric and improving soundness, can be improved with larger models and fine-tuning with specific dialog data. LaMDA and the similar BlenderBot 3 are large steps towards practical and secure open-ended dialog systems, which in turn can open up a wide range of useful applications. Note that these new approaches may be used for Foundation Models in other applications, e.g. question answering and story generation. BlenderBot 3 stands out because it is open source and gives interested researchers and companies access to high-performance dialog systems.

A fascinating application is emotional support for users, i.e. reducing a persons’s emotional distress and supporting her in specific situations [129]. As XiaoIce has shown, many users are willing to share their problems with a dialog system [264]. Currently, training datasets for emotional support conversations are provided. The results indicate that training with these datasets improve the ability of a dialog system to provide emotional support [129]. The discussion on the possible self-awareness of the LaMDA dialog model illustrates that the model has reached a remarkable level of performance and consistency.