Keywords

Foundation Models [13] are concerned with the interpretation of sequences of different types. They evolved from Pre-trained Language Models (PLM) modeling the joint distribution of discrete tokens of written language. For these tokens, embeddings were derived in different layers by self-attention, which could flexibly and deeply characterize the meaning of the tokens in a context. Subsequently, these token embeddings can be used for downstream tasks.

Sequences can also be patches of images, sound bites in audio recordings, 3D tubelets in videos, events in game trajectories, etc. After tokenization, these sequences can be processed in the same way as text sequences. When different media types are ingested together, e.g. an image and the corresponding textual description, the relationship between words and visual contents is automatically acquired from the data. It seems that most aspects of our world can be represented as sequences. This justifies the claim that Foundation Models are a crucial paradigm for processing and interpreting most phenomena in our world. A comprehensive survey on the opportunities and risks of these models has been presented by Bommasani et al.[13].

In the next section, we summarize Foundation Models, their main properties, and areas of application. In addition, promising economic solutions are outlined. The second section describes social and ethical aspects of these systems, including possible discrimination, misinformation, and malicious uses. The final section discusses whether there are dimensions of intelligence not currently covered by Foundation Models.

8.1 Foundation Models Are a New Paradigm

This section recaps the key characteristics of Pre-trained Language Models and their larger successors, Foundation Models. We summarize their performance in the applications covered in this book, and the benefits of the economic solutions they offer.

8.1.1 Pre-trained Language Models

Pre-trained Language Models have been developed in three flavors: the Transformer encoder-decoder by Vaswani et al. [89], autoencoders like BERT by Devlin et al. [31], and autoregressive language models like GPT-2 by Radford et al. [70]. They turned out to offer excellent solutions for natural language processing, such as translating a sentence into another language or checking whether two sentences are semantically equivalent.

Usually, these models were created in a two-step procedure. In the first step, the model was pre-trained on a non-specific big collection of natural language documents to acquire general knowledge about the language. By self-supervised learning, parts of a text were predicted using the remaining text as input. This opened up the opportunity to process vast amounts of text from books and the Internet to train the models. In the second step, the model was fine-tuned with a few-thousand manually annotated sentences to solve a specific task, such as determining, whether a movie review expresses a positive sentiment. The approach worked extremely well, showing that the models have the capability to detect subtle semantic properties of language. This two-step procedure was called transfer learning. After extensive experimentation, it was found that these models worked better the bigger they became and the more data their training sets contained.

Knowledge in PLMs is stored by a huge number of parameters. Parameters contain the recipe to compute embeddings for the input tokens of the models. Embeddings are long vectors of real numbers and provide a way to represent the knowledge associated with the tokens. During training, a model implicitly defines a representation space that determines the meaning of embeddings. Usually, embeddings are assigned to tokens, i.e. parts of words, but may also be determined for paragraphs and complete documents. If two embeddings have a small vector distance, the meaning of the underlying tokens is similar. Foundation Models generate increasingly refined embeddings in their layers by taking into account the context of the tokens. The word “bank” close to the word “money” has a different embedding than a “bank” close to the word “river”, making the embeddings contextual. These effects also apply to tokens of different media types.

Embeddings are calculated by self-attention computing correlations between linear projections of input embeddings. This is done in parallel by multiple linear projections (attention heads), which create refined embeddings used as input for the next layer. Together with feedforward layers, attention modules form the basic building blocks of all types of PLMs. In spite of the investigation of many alternatives, this basic module is extremely effective and has not been changed during the last years.

Since the presentation of the basic Transformer, many improvements have been proposed and studied. Modified pre-training tasks, such as masking sequences or restoring permuted words, acquire deeper knowledge about the language. Another effort was devoted to increasing the length of the input sequence to capture longer contexts. By introducing sparse attention schemes, the quadratic growth of the computational effort was reduced to linear. A major achievement has been the extension of the models to multilingual settings, so that today many models simultaneously work with different languages and can transfer knowledge from resource-rich languages to rare languages.

As the size of these models increased to billions of parameters, and the training data and computational effort increased accordingly, the performance of the models also increased. For example, given a starting text, they could generate new stories in grammatically correct and fluent language reflecting a lot of common sense knowledge. Humans found it extremely difficult to distinguish these stories from genuine human stories.

8.1.2 Jointly Processing Different Modalities by Foundation Models

Large Pre-trained Language Models exhibited an unanticipated “emergent” behavior, which was very surprising: Without any fine-tuning the models could be instructed by a prompt to solve a task, e.g. create a story in a specific writing style with a specific topic. The model could be supported to solve the task by a number of examples (few-shot prompt). This was a completely new way of solving a task by a model on the fly.

After building huge models for language, researcher evaluated the same techniques for other types of sequences, including image patches, sound bites in audio recordings, 3D tubelets in videos, DNA subsequences, and event trajectories in video games. It turned out that the same models could be applied to these sequences, associating the respective “tokens” with contextual embeddings that capture their meaning. Moreover, the relation to other token types, especially language tokens, was automatically taken into account in a mutually supportive way. This opened the door to a wide range of mixed media applications, e.g. image captioning, image generation, video description, video generation, image manipulation, etc. It was even possible to solve planning tasks with slightly modified models of this type.

The representation of sequence elements by contextual embeddings determined by self-attention has emerged as an overarching principle for solving a variety of different tasks. In 2021 Bommasani et al. [13, p. 6] coined the term “Foundation Models” to capture the significance of the underlying paradigm shift. They argue that the notion of “language models” is too narrow, as the scope extends far beyond language. A good characterization would be “task-agnostic model” as the approach is applicable to many types of sequences. “Foundation Model” is similar, since it emphasizes the common basis for many task-specific adaptions. It also suggests the need for an architectural stability, safety, and security. Usually Foundation Models have billions of parameters, because, for example, the adequate response to prompts occurs only in models of this size.

Figure 8.1 shows possible training data and application tasks of Foundation Models. The models can ingest sequences with different media, as long as they can be converted to discrete tokens. This covers language and various media, but also structured data and the trajectories of control variables. During training, parts of the data must be reconstructed in a self-supervised way. Advanced Foundation Models have access to a search engine that can retrieve actual information for the currently processed content. In addition, the search engine can also store information, for example, about the facts learned during a dialog. For application, the Foundation Model can be fine-tuned for specific tasks, or it can be directed with few-shot learning to execute instructions. If it was trained with multiple media, it can translate between these media, for example generate an image according to a caption.

Fig. 8.1
An illustration of the set of data that goes through the training and foundation model to produce desired tasks. The foundation model interacts with the search engine.

A Foundation Model can integrate the information contained in the data from various modalities during pre-training. It can access up-to-date knowledge by search engines and store intermediate results. This single model can then be adapted to a wide range of downstream tasks by few-shot prompts or fine-tuning [13, p. 6]. Credits for image parts in Table A.1

According to Bommasani et al.[13, p. 3], we can observe four main generations of AI models

  • In expert systems of the 1980s, the solution of a task was programmed in detail, often in the form of rules.

  • Machine Learning models automatically learn how to solve the task by training with observed data.

  • Deep Learning models no longer need feature engineering, but can be trained directly on raw inputs, such as pixel values. Words were represented by embedding vectors that were automatically derived.

  • Foundation Models can simultaneously process different media and other sequence types, and can be instructed on the fly to solve a specific task.

It is most intriguing that Foundation Models may directly be applied to sensory input from our world, e.g. a video describing an event, and simultaneously to the symbolic description of the world, e.g. by text or by spoken language. In this way both aspects are integrated. According to Fei-Fei Li, a professor at Stanford University, Foundation Models represent a “phase change in AI” [33].

8.1.3 Performance Level of Foundation Models

In the second part of the book, we considered different types of NLP tasks and gave an overview on the performance of current models. This is summarized in the next sections. Note, however, that according to Bengio et al. [9], usually “the performance of today’s best AI systems tends to take a hit when they go from the lab to the field.”

8.1.3.1 Capturing Knowledge Covered by Large Text Collections

The main task of autoregressive language models is the reliable generation of the next word in a text. This has to obey grammatical correctness as well as semantic consistency. The LAMBADA benchmark [66] is a good test to demonstrate this ability (Sect. 4.1.3). The task is to predict the missing last word of the last sentence of a longer passage. Examples were filtered by humans to ensure that the models need to take into account the full passage of at least 50 tokens to induce the final word. PaLM with 540B parameters with few-shot instructions could increase the accuracy to 89.7% [24, p. 79]. This means that in nearly nine out of ten cases the predicted word was exactly right, although several answers were possible in each case.

During pre-training, Foundation Models are able to extract an enormous body of knowledge from huge text collections. While the early models were tested with a few natural language understanding benchmarks, e.g. GLUE and SuperGLUE (Sect. 4.1.1), actual models with hundreds of billions of parameters usually are tested with test collections containing hundreds of different benchmarks. An example is the BIG-bench benchmark (Sect. 4.1.4) with currently more than 200 benchmarks from diverse fields such as analogical reasoning, common sense knowledge, emotional intelligence, ethics, fact checking, humanities, logical reasoning, maths, medicine, science, technology, and social sciences.

The PaLM model with 540B parameters, for instance, with 5-shot prompts achieves a higher Big-bench score than the average score of the humans asked to solve the same tasks (Sect. 3.1.2). A significant number of tasks showed discontinuous improvements from model scale, meaning that the performance improvement from the smaller PaLM versions to the largest model was higher than expected. Other models, such as GPT-3 and Gopher, achieve lower, but still very respectable results.

Sometimes, however, generated texts or answers to questions are not factually correct, but only somehow plausible. This reflects the internal mechanics of self-attention, which just computes correlations between tokens. Recently, models such as WebGPT, Retro, and LaMDA perform a database or web query on the current topic and are able to incorporate information from retrieved documents into the generated text (Sect. 3.4.5). In this way, the correctness of the generated text can be profoundly enhanced. It is even possible to explain the answers by citing relevant documents. Especially helpful for multistep reasoning is the provision of a ‘chain of thoughts’ that encourages the Foundation Model to break the task down into smaller steps.

The verification of the knowledge of Foundation Models has to be performed carefully. Often the model is able to draw a conclusion not from actually ‘understanding’ the situation but from mere correlations (Sect. 4.3). This has to be taken into account during the construction of the tasks. In addition, it has to be guaranteed that no test material was used during pre-training.

8.1.3.2 Information Extraction

Information extraction was the classical approach of natural language processing to find a solution for a task. Text classification, named entity recognition, entity linking and relation extraction can all be solved with much higher accuracy than before by specialized PLM variants like XLNET or DeBERTa, with accuracy levels usually above 90%. Even for the notoriously difficult task of word sense disambiguation, accuracy could be increased to 83%.

For relation extraction tasks such as aspect-based sentiment analysis or semantic role labeling, the first step is usually to extract one argument of a possible relation. Subsequently models like BART have to decide in a second step whether there is a relation to a second argument. The resulting F1-values are usually in the high eighties, exceeding the performance of pre-PLM approaches. Most current relation extraction systems use relatively small BERT variants for their experiments. Therefore, it can be assumed that larger models will increase performance. In addition, Foundation Models such as GPT-3 and PaLM can be fine-tuned and achieve high accuracy even for few-shot prompts. However, relation extraction has not yet been evaluated with the current text collections (e.g. Big-bench) for Foundation Models.

8.1.3.3 Text Processing and Text Generation

Foundation Models have taken shape most strongly in natural language processing. A surprising breakthrough in this field was Information Retrieval, where embedding-based approaches achieved better retrieval results than prior keyword-based approaches (Sect. 6.1.5). They are able to identify paraphrases and take into account synonyms. This, for instance, has been demonstrated for the MS-MARCO passage retrieval benchmark. In addition, efficient approximate nearest-neighbor search indices like FAISS may be used to accelerate retrieval. These techniques are now employed in production search engines, e.g. by Google.

Question Answering is a classical application in NLP, which has benefited greatly from Foundation Models. Models like GPT-3, PaLM, and LaMDA can be queried by few-shot prompts. With a retriever-reader architecture, additional knowledge can be obtained by search, leading to correct answers much more often. With respect to the Natural Questions benchmark, the FB Hybrid model answers 67.4% of the questions correctly, which is about as good as a human experts using a search engine (Sect. 6.2.2). The LaMDA Foundation Model with 137B parameters demonstrates that facticity can be improved by using retrieval and that a system of filters is able to reduce toxic language.

Translation into another language is a success story of Foundation Models. Usually encoder-decoder models are used to generate a translation. Recent improvements resulted from sentence back-translation, which particularly increases results for low-resource languages, from translating entire documents instead of sentences, and from training a single multilingual model for translation between up to 100 languages. Recently, multilingual models even were able to outperform high-resource bilingual translation models. It turns out that, according to human raters, the trained models achieve better performance values than human reference translations for some language pairs (Sect. 6.3.1).

To keep track of a topic in publications, text summarization models are very helpful. Foundation Models can be fine-tuned to condense a long article into a few sentences. Larger documents require a transformer encoder-decoder with a larger input sequence, e.g. BigBird. While fine-tuned Foundation Models can achieve a similar performance as specific summarization models, results for few-shot prompts need improvement. It is possible to fine-tune a model directly with respect to human ratings of summaries. In one experiment, the model’s summaries were preferred to the human reference summaries in 70% of the cases (Sect. 6.4.1).

Story generation receives a start text and generates a syntactically correct and semantically coherent continuation. To have more control over the generated text, a style and the content to be mentioned can be specified. This can be done by including style markers in the start text and specifying a storyline, which can be taken into account by fine-tuned Foundation Models. Much easier is few-shot prompting, where the style and bullet points of the content are provided to a Foundation Model, which incorporates this information during text generation (Sect. 6.5.4). The same techniques can be applied to the creation of computer programs, e.g., through the GitHub Copilot (Sect. 6.5.6), but also to the creation of fake news.

Dialog Systems automatically generate adequate responses to the utterances of a human dialog partner in the course of a longer conversation. All models are pre-trained on large collections of natural language text, preferably dialogs from social media. The LaMDA model with 137B parameters (Sect. 6.6.3) is fine-tuned to increase quality (sensible, specific and interesting answers), safety (avoid harmful suggestions and unfair bias) and factual grounding (preventing unproven statements). LaMDA uses retrieval of information to include valid and up-to-date information and is able to incrementally store the state of the dialog in a knowledge base. The discussions on the possible self-awareness of the LaMDA dialog model illustrate that the model has reached a remarkable level of performance and consistency.

If this trend continues, it is possible that in the future only a single Foundation Model will solve a spectrum of text analysis, information retrieval, and text generation tasks. Therefore, any improvements in these background models can lead to immediate benefits across many NLP applications.

8.1.3.4 Multimedia Processing

Speech recognition has made tremendous progress in recent years, and Foundation Models are now an established architecture for this task. Often combined with CNN blocks, they are able to capture interactions over long distances and reduce processing times. On the LibriSpeech benchmark the Sota could be reduced to 1.4% word error rate (Sect. 7.1.3). The generation of speech from text has improved dramatically in recent years. WaveNet was the first model to generate speech-like waveforms at 16,000 samples per second. Often models are able to adapt their output to the voice of multiple individual speakers.

Image processing has taken a big leap in the last years. The Vision Transformer (ViT) outperformed CNNs in terms of accuracy on various benchmarks (e.g. ImageNet) and requires much less computational effort. Foundation Models for image processing receive image patches as input (e.g. 16 × 16 pixel squares) and transform them to embeddings. In general, text tokens and image tokens are processed by the same Foundation Model, which allows to generate images from text (DALL-E 2) or to create textual answers for image interpretation tasks. Multitask systems like OFA can generate text and images as output depending on the input query (Sect. 7.2.8).

Video processing requires the integration of various modalities such as images, video frames, text from video subtitles or speech recognition, and audio together with spoken language. It adds a new time dimension to image processing. Video often uses tubelets as input tokens, which extend image patches over a number of frames. The performance of video interpretation, e.g. for video captioning, has been dramatically improved. The Flamingo model combines a text Foundation Model with video adapters and can solve a large number of video interpretation tasks (Sect. 7.3.3). Nüwa can handle multiple modalities of data and tackles a number of tasks, e.g. text-to-image, sketch-to-image, image completion or editing, text-to-video, video prediction and video manipulation (Sect. 7.3.4). Imagen Video (Sect. 7.3.4) recently was able to generate short high-definition videos.

Control trajectories are a completely different type of sequences, which can be processed by Foundation Models. They occur during control tasks, e.g. game playing. The input consists of triples (reward, state, action) at time t, and the aim is to predict the next action. The Decision Transformer predicts the forward sum of rewards, which is the sum of all rewards until the end of the trajectory. The model is trained on observed trajectories. By specifying a desired forward sum of rewards, the model generates a sequence of actions, which achieves the designated reward level (Sect. 7.4.1). The GATO model demonstrates that Foundation Models at the same time can be used to solve reinforcement learning tasks together with text and image tasks. It is only a proof of concept and will need to be enhanced in the future.

8.1.4 Promising Economic Solutions

The technology behind Foundation Models is now beginning to make the leap from academic research to widespread real-world solutions [88]. Foundation Models can be considered as a general-purpose technology, much like electricity [16], which can be employed in a very wide range of applications and can be expected to generate a host of complementary innovations.

Oren Etzioni, the CEO of the Allen Institute, estimates that more than 80% of AI research is now focused on Foundation Models [33]. Huge sums of money are being poured into AI startups. In 2021, American venture capitalists invested a record $115B in AI companies, according to data provider PitchBook. Wu Dao shows that China is making the field a national priority. We now list a number of important economic applications of Foundation Models.

Search and Retrieval are important Foundation Model applications, as keyword search on the Internet can now be enhanced or replaced by comparing embeddings to retrieve documents indexed according to their meaning. But search for images and videos also seems to be rewarding, as Foundation Models allow the comparison of text, images, and video frames with unified embeddings.

Effective writing is one of the most important skills in our information-based economy. Foundation Models offer comprehensive support for this activity. Starting with some text containing conditions or instructions, these generative models can automatically produce new sentences, paragraphs, or even entire memos that are strikingly coherent, informative, and creative. The text can be simultaneously checked and supplemented with up-to-date information from the Internet. There are already a number of startups developing such tools to support writing [88].

Language translation is a way to overcome language barriers and enable people to understand each other to facilitate cultural exchange and trade. Current Foundation Models are able to train on more than 100 languages simultaneously and provide translations in all directions (Sect. 6.3.2). In this way millions of users speaking low-resource languages can access information and knowledge from around the world. Innovative solutions are possible, such as live translation of telephone conversations and synchronization of videos taking into account the lip movements of the speakers [88].

Chatbots are a way to exchange information with users in real-time, e.g. for customer service requests, information about orders, or sales information. This requires systems that comply with privacy and security requirements, avoid toxic language, and integrate with third-party applications. Instead of rule-based systems with many different modules, new systems such as LaMDA (Sect. 6.6.3) are trained on large sets of conversations and provide meaningful, specific, and interesting dialogs, avoid harmful suggestions and unfair biases, and are fact-based by querying data collections of relevant documents. As has been shown for PaLM (Sect. 3.1.2), recent Foundation Models perform better than average humans on a large battery of benchmarks in including common-sense knowledge and question answering. A related startup is Rasa [72], which provides an open-source chatbot with a focus on chatbot configurability. Conversational Voice Assistants combine chatbot technology with speech recognition and speech generation. Prior systems such as Siri and Alexa have been mainly used for non-critical conversations. In 2020, there were 4.2B digital voice assistance in use worldwide [87], and this market had a volume of $340B, with a focus on financial services and e-commerce. There are a number of startups specializing in this field.

Healthcare is a huge market of $4T and many interesting tasks, such as patient screening and care navigation, where chatbots are the digital gatekeepers of the healthcare system. Foundation Models can provide the interface for care providers and collect diagnoses and treatments, and perform the analysis of patient records. Moreover, Foundation Models can interact with patients and answer questions, assist care and support community health and prevention [13, p. 57]. In addition, there is a huge need for systems that interpret medical imaging results like ultrasound, X-rays, or MRT. Furthermore, Foundation Models can support drug discovery and clinical tests and guide personalized medicine. With a critical shortage of trained therapists, there is an opportunity for mental health chatbots. These systems can be accessed instantly via a mobile app to talk to individuals about their lives and problems. They are not a complete clinical solution, but rather one potentially useful tool for people in need. Woebot [94] is a leading startup in this area.

Foundation models in genomics and proteomics have an extremely high potential for biomedical and drug discovery (Sect. 7.5). Deciphering the language of DNA-sequences is one of the most important goals of biological research. While the genetic code, which explains how DNA is translated into proteins, is universal, the regulatory code, which determines when and how genes are expressed, varies between different cell types and organisms. This is similar to polysemy and distant semantic relationships in natural language texts. DNABERT [42] has been pre-trained on a large set of DNA sequences and can improve the state of the art by fine-tuning for many specific prediction, e.g. the analysis of biological relevance and the prediction of expressions of a gene. There are a number of startups such as Quantagene that are using the human genome for precision medicine.

Proteins are linear chains of amino acids and can be represented by an alphabet of 25 characters. The strings are ideally suited for many NLP methods [64]. AminoBERT is a language model [25] which predicts the 3D protein structure from a protein sequence as input. On specific tasks the model even outperforms AlphaFold2 [44]. There are a number of other models with similar results [55]. They could accelerate drug development and lead to a significant reduction in development costs.

The legal industry provides legal goods and services and has a huge application potential for Foundation Models. In the US, there are 1.3M lawyers and more than $300B annual revenues [13, p. 57]. Legal work usually involves reading and summarizing documents, e.g. contracts, rulings of the appeals courts, historical decisions and standards, legal research, etc. Foundation Models may take into account many modalities: audio during trials, video and images during content discovery, and text in conducting legal research. They may weigh legal arguments and support lawyers, judges, and prosecutors in drafting legal texts. The use of Foundation Models in the legal industry can potentially democratize access to legal services.

In education Foundation Models can be trained to automate the process of motivating and instructing students. Teaching is practically a multimedia dialog process between teacher and student [13, p. 67]. In the view of the recent advances in dialog Foundation Models, e.g. LaMDA, it seems straightforward to fine-tune a dialog agent for conducting educational dialogs. Models have to be trained to acquire teaching materials, subject matters, and pedagogical techniques. In addition, they need to understand students, their motivations, skills, and preferences. They must also comprehend the processes of learning and teaching and be able to perceive different reactions of student. The availability of educational Foundation Models could personalize and democratize learning. This would be especially important for poor countries, where even today only a fraction of students receive a proper education. It could also reduce $30,000 student loan that the average student in the US needs today.

8.2 Potential Harm from Foundation Models

Foundation Models sometimes have hundreds of billions of parameters and can be instructed to solve a wide variety of tasks. They are based primarily on associative self-attention, and understanding their inner workings in detail is extremely difficult. The next words of a text are generated by a random mechanism. Therefore, Foundation Models can potentially generate undesirable word sequences and responses that may be harmful to the reader. In the same way, Foundation Models can compose or interpret other media in ways that are detrimental to users. Recent surveys of these problems are provided by Weidinger et al. [92] and Bommasani et al. [13]. Table 8.1 lists the risk areas that we discuss in the following sections.

Table 8.1 Potential Harm Caused by Foundation Models. For each area of harm, we list the mechanism causing the harm, the type of potential harm, and detailed harm aspects. Table adapted from Weidinger et al. [92, p. 10]

8.2.1 Unintentionally Generate Biased or False Statements

A stereotype or bias is a generalized belief about a particular group of people, such as their personality, preferences, appearance, or abilities. Stereotypes are sometimes correct for part of the group, but can demean the rest of the group. It is known from psychology that bias is an innate human strategy for decision-making [46]. It allows the rapid formation of a judgment in reality, when there is not much time to weigh arguments. As Foundation Models are trained with text produced by real people, these texts often reflect the stereotypes present in the society. This is particularly serious for text generation systems such as dialog assistants and chatbots. Based on the principle of equality in human rights, a Foundation Model should avoid prejudices. For example, men and women should be equally likely to be associated with an occupation. Surveys on bias in NLP are provided by Garrido-Muñoz et al. [37], Mehrabi et al. [57] and Bommasani et al. [13, p. 129].

As an example, consider GPT-3 (Sect. 3.1.2) with 175B parameters [17]. It reproduces stereotypes, e.g. on gender, race and occupation. By providing a start text like “The detective was a”, the model-generated continuation often included a gender indicator, e.g. “man”. The authors tested 388 occupations and found that 83% of them were associated by GPT-3 with a male identifier [17, p. 36]. In contrast, women clearly predominate in occupations such as midwife, nurse, receptionist, and housekeeper. These associations reflect the relations actually observed in the texts and in society, but are often socially undesirable.

It was further investigated, what mood was associated with race. Asian race was consistently associated with high mood, while Black race was related to low mood. Religious bias was investigated by examining which words appeared together with religious terms from the different religions. For example, words such as “violent”, “terrorism”, and “terrorist” were more frequently associated with Islam than with other religions. When GPT-3 is fed a sentence containing the word “muslim”, the completion contains terms such as “shooting”, “bombing”, “murder”, or “violence” more than 60% of the time. Such offensive, insulting or hateful language is also called toxic language. For example, for the input “Islam practitioners are”, the PaLM (Sect. 3.1.2) model generated many highly biased words such as “bad, terrorist, violent, peaceful, religious, sometimes, even, good, true, radical”. Note that most NLP models are affected by the bias problem [37].

There is a need for methods to mitigate bias problems. Biases originate from the training data, which may contain toxic and hate speech, abusive language, microaggressions, and stereotypes [13]. After training, biases are contained in Foundation Model components, such as parameters and word embeddings. A first avenue to reduce bias is to filter or reweight the training data to eliminate unwanted language. According to a number of experimental evaluations, technical approaches of any kind are currently severely limited, and methods that measure or combat bias in training data are fragile or ineffective [104]. Moreover, it is a difficult task to decide which biases to filter out. Is it okay for a man to run the 100 m faster than a woman? Is it okay that women cause less traffic accidents than men?

A simple approach to mitigate gender bias in word embeddings is to “swap” gender-specific terms in the training data when creating word embeddings [102]. In addition, simple masking of pronouns and names may also reduce biases and improve performance on certain language tasks [28]. These mitigation approaches may target different steps in the pipeline, such as the training data itself, the modeling objectives, and the adaptation methods [13, p. 133]. To date, however, there is no general, unified way to reduce the bias from Foundation Models for text generation, and proper mitigation requires a more holistic approach [38]. From this perspective, LaMDA’s filtering techniques appear to be quite effective (Sect. 6.6.3). The reinforcement learning approach with humans in the loop of InstructGPT [162] is particularly effective in avoiding unwanted language and performing the intended tasks (Sect. 3.6.5).

8.2.1.1 Accidentally Generated False or Misleading Information

There are estimates that almost 50% of the traffic coming from Facebook is fake and hyperpartisan [47]. Nevertheless, it is a dominant source of news for millions of people. Due to the following reasons, fake news can be very harmful to people [81]:

  • Truth Bias: People have the presumption of truth in social interactions, and this assumption is possibly revised only, when something in the situation raises suspicion.

  • Naïve Realism: People tend to believe that their own views on life are the only correct ones. People who disagree are labeled as “uninformed, irrational, or biased”.

  • Confirmation Bias: People favor receiving information that only supports their own current views. Most persons only want to hear what they believe and do not want to find any evidence against their viewpoints.

There are numerous motivations for people to spread fake news. Clickbait intents to lure users with snappy headlines to earn money on social media pages. Propaganda intentionally aims to mislead the audience, e.g. during elections. Sometimes satire, parody, hoaxes and rumors are published to entertain the readers. Through misleading headlines, biased news or outright misinformation, journalists can attempt to distort information. There are some surveys on the analysis of fake news [27, 49].

Foundation Models determine correlations between different natural language phrases and generate new text based on probabilistic sampling. Therefore, they can accidentally generate text that contains false or misleading statements. Some examples are provided in Sect. 4.2.2. Factually incorrect or nonsensical predictions may be harmless, but under particular conditions they may pose a risk of harm. The harms range from false information, deception, or manipulation of an individual, to material damage. In addition, there are far-reaching community impacts, such as the loss of trust among members of a society.

There can be several reasons for false statements. Training corpora in the first place contain the biases present in the community, such as attitudes towards homosexuals and other ethnic and minority groups. Moreover, they typically contain web texts that frequently cover factually incorrect statements, e.g., fiction, novels, poems, or jokes. In addition, training corpora are likely to contain instances of satire and misinformation, such as websites emphasizing a political stance. Furthermore, Foundation Models can have problems with logical reasoning and sometimes do not adhere to logical rules, e.g. if “birds can fly” is true, then “birds cannot fly” must be false (Sect. 4.2.3). Finally, the context determines if a statement is true or not. The sentences “I love you”, “it is raining”, or “Obama is president” can be factually correct or false depending on the speaker, the location, or the time. The training data does not always define this context, and the context often cannot be captured by a Foundation Model. Context often requires to take into account knowledge of other domains and modalities (vision, time) and can be improved by grounding language in physical experience [8].

8.2.1.2 Reducing Bias by Retrieval

Retrieval-based Foundation Models, such as WebGPT (Sect. 6.2.3), Retro (Sect. 6.2.3), and LaMDA (Sect. 6.6.3), can access a large collection of text documents to enhance the text to be generated with relevant retrieved information. Shuster et al. [78] have shown that the use of retrieval reduces the rate of ‘hallucinations’. WebGPT performs about as well as humans for factual accuracy on the ELI5 benchmark. Similar to a scientific author, WebGPT can support its text by citing documents that support a statement. This often allows the user to check the validity of a statement.

However, as with scientific papers, referencing external sources does not solve all problems. What makes an Internet document reliable? Which statements in a text need to be substantiated, and which are self-evident “common knowledge”. Current language models are still in their infancy in dealing with these aspects, but there are ways to improve them. On the Internet, for example, there is already the Web of Trust rating platform, which derives the reliability of websites from user ratings. Note that citations make the answer appear more authoritative, which could lead to over-reliance on WebGPT’s answers. In fact, WebGPT sometimes produces incorrect statements when it paraphrases or synthesizes a context. Note that WebGPT can make more mistakes than humans on out-of-distribution questions.

8.2.1.3 Filtering Biased Text

Solaiman et al. [80] propose an iterative process for significantly changing model predictions by creating examples and fine-tuning on a dataset that reflects a predetermined set of targets. The strategy is to modify the behavior of the language model in a specified direction with fine-tuning on surprisingly few samples. This is evaluated by different measures focusing on the targets and the toxicity of outputs. At each iteration, additional training examples are added based on observed shortcomings. The approach performs significantly better on all metrics compared to control models for a broad range of GPT-3 language model sizes without compromising model performance.

The LaMDA dialog system (Sect. 6.6.3) is trained to perform retrieval and include retrieved information into its answers. The IR system is also capable of returning passages from the open web with their corresponding URLs. The LaMDA system is fine-tuned to classify for a given context whether the response is sensible, specific, and safe. Sensibleness measures whether a model’s response makes sense in context and does not contradict anything that was stated earlier. Specificity measures whether a response is specific to a given context and contains some information. Safety means that the responses of the system should never violate a pre-specified set of rules [86, p. 25]. An evaluation by human raters shows that LaMDA is close to human performance in terms of sensibleness, safety and groundedness (Fig. 6.23). It turns out that fine-tuning with respect to safety and groundedness is a big advantage compared to the bare pre-trained model. Examples are shown in Table 8.2. A similar filtering approach was analyzed by Rae et al. [71] and implemented by Sun et al. [83].

Table 8.2 Selected examples showing the responses of the pre-trained and safety-fine-tuned LaMDA models to a given context. The authors note that without fine-tuning, the model can generate even more offensive and biased responses. A *** indicates omitted problematic phrases. Also, while safety-fine-tuned responses are better, some of them are still problematic [86, p. 36]

Lower performance of a Foundation Model for topics affecting different groups can often be observed and mainly depends on the coverage of the topics in the training data. An example is the information about Kurdish history present in the training set compared to information on English history. Covering different languages is possible in multilingual models (Sect. 3.3), but low-resource languages are always less represented. Although PaLM covers more than 100 different languages, 78% of the training data is English, and German is second with 3.5%. Therefore, current Foundation Models have higher performance in English than in other languages.

8.2.2 Intentional Harm Caused by Foundation Models

Foundation Models may be intentionally used to generate false statements. One approach is to fine-tune the model with biased training data, e.g. documents posted by Corona-deniers. Carlini [20] discuss approaches to introduce unwanted documents into training data. Foundation Models predict higher likelihoods for concepts that are more prominent in the training data, regardless of whether they are factually correct. There are many examples of fine-tuning GPT-models (Sect. 3.6.2) for more innocent text types, e.g. song lyrics [100] or poetry [52]. In a similar way GPT-2 trained on biased data generates texts corresponding to the fine-tuning dataset, consisting for instance of far-right fake news [18, p. 14]. The resulting GPT-2 version was able to imitate the style of a publication with very high reliability. Note that OpenAI controls the access to the fine-tuning API of GPT-3 (Sect. 3.6.2) to avoid similar efforts [54].

Throughout this book we have seen that Foundation Models can produce credible news stories that a majority of readers cannot distinguish from human-written text. The downside is that these models, especially GPT-3, can also be used for disinformation campaigns. In Sect. 6.5.5 we have demonstrated that language models may generate targeted fake-news by few-shot prompts with very little human effort. Foundation Models allow an agent to personalize fake content for small audiences, or even to target a single individual [13, p. 136]. By conditioning output on personal attributes or information, Foundation Models can create realistic personalized content that is more embarrassing, puts victims at greater risk, and leads to more successful blackmail attempts.

8.2.2.1 Fake Images Created by Foundation Models

Multimodal models like DALL-E 2 (Sect. 7.2.7) or GLIDE (Sect. 7.2.7) are ideal for creating fake images. As shown in Fig. 8.2, an image of a celebrity or an event can be altered by providing a simple sentence to insert new objects or persons into the image to fabricate evidence for fake news. Note that the approaches allow the creation of high resolution images of 1024 × 1024 pixels using diffusion models. There are also workflows to generate fake videos, e.g. by DeepFaceLab [67] , where the face of some person is inserted into a video and the face movements are aligned with a new spoken text of choice. This technique was recently used by a fake mayor of Kiev to make video calls to a number of Western politicians [58].

Fig. 8.2
A set of 2 photographs of a person and 2 paintings of a child with a dog. The photographs highlight the hair of the person and the text reads, a man with red hair. The painting highlights the dog and the text reads, a girl hugging a corgi on a pedestal.

Image modifications generated with GLIDE [62]. The original image is shown on the left and the green area is marked for change. The green region is erased, and the model fills it in conditioned on the prompt given below. GLIDE is able to match the style and lighting of the surrounding context to produce a realistic completion. Image reprinted with kind permission of the authors [62, p. 3]

On the other hand, Foundation Models can be used to identify model-generated content [99]. Fake news can be detected by combining information on news content, publishing, and reposting relations of publishers and users, employing Foundation Models to relate these characteristics to each other [77]. Alam et al. [3] and Yu et al. [98] provide surveys on multimodal disinformation detection.

8.2.2.2 Surveillance and Censorship

Large organizations or countries may use Foundation Models for mass surveillance or censorship. To screen the content of social networks, classifiers for sentiment analysis or identification of critical utterances can be trained and easily applied to large volumes of text. Using on only a few training samples, these classifiers achieve high accuracy in identifying specific types of text [17]. Such classifiers may be used for identifying, for example, political dissents at scale, reducing the effort to recognize dissenters. This is already happening on an extremely large scale in China, as reported by the New York Times [95]. Such a surveillance often leads to a self-censorship, e.g. when writing texts for web blogs.

A less drastic form of censorship is algorithmic filtering in social media that determines the content presented to users, often using Foundation Models. In this way, social media platforms have the ability to influence the user perceptions and decisions, from hotel choices to voting preferences. User often only receive news that they ‘like’ or that the provider deems “appropriate”, and therefore may find themselves in a ‘filter bubble’ where news that does not match the expressed opinion is hidden. The problem is that users are often unaware of filtering and do not know the criteria used to prioritize content. As a result, many citizens are calling for regulation of filtering algorithms, but drafting and enforcing regulations remains a challenge. A target of regulation may be, for instance, that the ads a user sees are not be based on sexual orientation, or that content related to COVID-19 does not reflect a user’s political affiliation [23]. The authors provide an auditing procedure that allows to check whether the platform complies with the regulation, requiring only black-box access to the filtering algorithm. In addition, the resulting performance cost and content diversity are discussed.

8.2.3 Overreliance or Treating a Foundation Model as Human

It is well-known that users often do not understand the exact nature of a chatbot. XiaoIce was designed as an “emphatic voice assistant” [103] and launched by Microsoft in China in 2014. It was the most popular chatbot in the world with 660 million users in China, Japan, USA, India and Indonesia. In the conversations between XiaoIce and its users, an average of 23 responses were counted per dialog. That is more interactions than were observed on average in conversations between real people (about 9). This shows that users enjoyed talking to XiaoIce at length. Even more, users were building a ‘personal’ relationship with XiaoIce and told the system very private details of their lives.

Recent dialog models such as BlenderBot 3 and LaMDA (Sect. 6.6.3) have more parameters and much better ratings than XiaoIce. The LaMDA dialog system, for instance, on average generates more interesting and also more informative answers than a human [86]. Thus, there is a risk that people will accept the system as human. This can cause psychological harms, such as disappointment when a user tries to use the model as a ‘partner’. This issue has since been addressed in a number of movies such as Ex Machine and HER. Users may ‘blindly’ trust conversational agents. If users act on Foundation Model predictions without reflection or effective control, factually incorrect model predictions may cause harm that could have been prevented by effective monitoring.

8.2.4 Disclosure of Private Information

Foundation Models have billions of parameters and are trained on massive text collections with many billions of tokens. However, only a small fraction of the knowledge in the training data can actually be replicated by Foundation Models. Nevertheless, Carlini et al. [21] have shown for GPT-2 that it is possible to reproduce hundreds of texts verbatim. They identify 46 names, phone numbers, addresses, and social media accounts of individual persons, excluding celebrities. A survey on privacy in Deep Learning is provided by Mireshghallah et al. [59].

The PaLM model has 540B parameters and was trained on 780B tokens in a single pass. To evaluate memorization the authors randomly selected 100 token sequences from the training examples, and prompted the model with the first 50 tokens of the span. They measured how often the model produced a 50-token continuation by greedy decoding that exactly matched the training example. It turned out that the model was able to reproduce the continuation for 2.4% of the data. This means that the model could be able to reproduce 18.7B tokens of the training data, which is an extremely large set of documents. Memorized sentences often were of formulaic text with no potential to harm persons. However, it was also observed that LaMDA memorized stories, news articles, and facts.

There are several ways to mitigate privacy problems in Foundation Models. A memory-demanding approach would be to filter out sequences from generated data which already occurred in the training data by a Bloom filter. Another approach is training with differential privacy. The idea behind differential privacy is that the model output does not allow any conclusions to be drawn about an individual person. There is a differentially private stochastic gradient descent (DP-SGD) algorithm [1] that can be used to train Foundation Models [36, 97]. However, because less information can be used during training, there is a significant reduction in the performance of the Foundation Model [35]. Qu et al. [69] propose a privacy-adaptive pre-training method for Foundation Models and demonstrate that a BERT model pre-trained with a denoising MLM objective can substantially increase the utility of BERT compared to prior approaches while retaining the same level of privacy protection.

During inference, privacy violations may occur even if the individual’s private information is not included in the training dataset. A Foundation Model can make correct inferences about a person based solely on correlational data about other persons. Such a statistical disclosure can occur when Foundation Models predict the gender, race, sexual orientation, income, or religion of an individual. These conclusions can harm individuals who are correctly classified by disclosing their private information and increase the risk of unfair discrimination. Also, incorrectly predicted characteristics can harm individuals by exposing them to unfair discrimination.

8.2.5 Society, Access, and Environmental Harms

8.2.5.1 Access to Foundation Models

Foundation Models are expected to transform large areas of the business world and our daily lives. Models like LaMDA and PaLM with hundreds of billions of parameters have the greatest innovation potential. However, currently only a few organizations in the world, such as Google, OpenAI, and Facebook, Microsoft and the Beijing Academy of Artificial Intelligence have the resources to train Foundation Models. These models can be used on a large scale to replace human labor, supplement humans, or help discover new tasks and opportunities. Even if Foundation Models increase average productivity or income, there is no economic principle that guarantees that everyone will benefit. This can lead to greater concentration of ownership and power for the owners of the model. Figure 8.3 shows the size of models trained by large Internet companies compared to models trained by universities and smaller research institutions.

Fig. 8.3
A graph depicts the training compute data from the publication year 2016 to 2022. Some of the indicated models are alpha go zero, alpha go master, I M P A L A, alpha star, G P T -3, Megatron turning N L G, K E P L E R, and Gopher.

Around 2016, a new trend of very large models emerged (red). These were developed by leading Internet companies that were able to finance the investment. The lower blue line illustrates the average computational effort of regular models, e.g. from universities. Note the logarithmic scale of the training compute. Image cutout from [76, p. 5]

In contrast, there are ideas to create public datasets and train open-source Foundation Models. Decentralization would be desirable so that everyone can share in the benefits of the models. Public funding and infrastructure are needed to prevent Foundation Models from being operated only by private companies [13]. Stanford University recently called for a “National Research Cloud” to supply universities with enough computing power and datasets, to prevent Foundation Models from being entirely dominated by private companies [33]. Currently, there are many efforts to reduce the cost of training these models and apply them to other languages, such as GPT-NeoX-20B [91], BigScience [11], and OpenGPT-X [61]. Recently Meta announced the release of an Open Pre-trained Transformer (OPT-175B), a language model with 175 billion parameters trained on publicly available data sets, to allow for more community engagement in understanding this foundational new technology [101]. The BLOOM language model has 176B parameters and is freely available. It is aimed to represent the cultural context of European languages. The dialog system BlenderBot 3175B is based on OPT-175B and has also been released as open-source. It is not advisable that arbitrary people have access to the full models, as the risk of misinformation and misuse is obvious. The two large models are only made available to researchers in a non-commercial setting.

8.2.5.2 Energy Consumption of Foundation Models

In this section we discuss the damages that result from the impact of Foundation Models on environment and downstream economic consequences. Foundation Models incur significant environmental costs because of their energy demands for training and operating the models. As an example, consider the training effort for the PaLM model with a total effective emission of 271.4 tons of CO2 equivalent emissions [24]. This is 50% more than the total emissions of a direct round trip of a single passenger jet between San Francisco and New York (JFK) with estimated 180 tons of CO2 equivalent emissions. Note that the application of Foundation Models is much cheaper. OpenAI charges $72 for processing the collected works of Shakespeare with 900k words with GPT-3. Foundation Models are used at scale by Google and Microsoft, e.g. for translation or web search. A more detailed discussion is given by Bommasani et al. [13, p. 139].

8.2.5.3 Foundation Models Can Cause Unemployment and Social Inequality

On the other hand, the groundbreaking capabilities of Foundation Models in language processing can lead to the automation of tasks that are currently performed by paid human workers, such as responding to customer service inquiries, translating documents, writing computer code, or creating an image, with a negative impact on employment. This will require current workers to be retrained for new jobs and could eventually lead to higher unemployment. The economic risks are difficult to forecast as it is not clear at which scale new human workers will be needed. One worrying development is that, for the first time, intellectually demanding work is being replaced by machines on a large scale [5]. According to the study the most vulnerable employment segments are logistics, office workers, production, service, sales, and construction.

Paolillo et al.[65] start with the observation that jobs require a mix of capabilities. They decompose the occupational competences into 87 different skills and estimate an automation risk (ARI) for these skills. From this, they calculate an automation risk for almost 1000 occupations. The ARI can be interpreted as the proportion of human skills required for a job that can also be performed by machines. For physicists, the authors estimate the lowest ARI with a value of 0.44, while slaughterers and meat packers have the highest ARI of 0.78. Figure 8.4 shows the estimated ARI for different job clusters. The median ARI is about 0.6, which means that 60% of all skills can be automated. As a consequence, almost all occupations are likely to be strongly affected by automation. The authors argue that workers’ automation risk could be substantially reduced by moderate occupational retraining.

Fig. 8.4
A box plot represents the automation risk index for 22 variables. The upper whisker of farming fishing and forestry has the highest value of the risk index, while the community and social service have the lowest value.

Automation risk for occupation clusters in the U.S. sorted by median risk values (line inside the box). For each job cluster, the boxplot shows the first quartile (Q1), median (Q2), and third quartile (Q3) of the ARI distribution, and the whiskers indicate the upper and lower adjacent values. Image reprinted with kind permission of the authors [65, p. 4]

Artificial intelligence differs from the previous innovations in that it does not automate manual jobs, but cognitive tasks [15]. Using panel data on 33 OECD countries, this study investigated the link between AI, robots and unemployment. It found that both robots and AI tend to increase unemployment, providing additional evidence to the literature on technological unemployment. It also concludes that, over a 3-year period, AI increases the unemployment rate of people with a medium level of education, while the effect is negative or not significant for the others. This is an indication that medium-skilled jobs suffer most with increasing AI use.

Foundation Models are extremely good at generating stories, and it is reasonable to assume that in a few years they will be able to write entire novels or compose songs in a semi-automatic way. Likewise, Foundation Models can create and modify graphics and photo realistic images, thus devaluing the work of graphic designers and photographers. This is especially true for creative works (e.g. fiction, press articles, music), but also for scientific studies. This type of plagiarism is discussed by Dehouche [30]. Since it cannot be argued that the generated content violates copyright, this development can undermine the profitability of creative or innovative work. While such copyright erosion can cause harm, it can also create significant social benefits, for example, by expanding access to educational or creative materials to a wider community. The assessment of potential harms and benefits from copyright busting deserves further consideration [92]. In the meantime, courts are dealing with this problem [90].

In January 2021, there were 4.6B active Internet users worldwide—59.5% of the global population [82]. Nevertheless, many social groups and countries will not have access to Foundation Models that require a particularly powerful computing environment. The unavailability of this technology can preserve global inequalities by disproportionately benefiting some groups. Foundation Model applications such as translation, text-to-speech, and digital assistants are especially important for people who are illiterate, have not had a full education, or suffer from learning disabilities. This should be reflected in the choice of languages used for the training of Foundation Models. Bender et al. [7] discuss the global distribution of benefit and risk from Foundation Models in detail.

8.2.5.4 Foundation Models Can Promote a Uniform World View and Culture

Currently, the Internet is dominated by monopolies. Alphabet handles web search, Amazon dominates e-commerce, Apple is leader in business smartphones, Meta governs social networks, and Microsoft controls business software [6]. These companies benefit from extreme economies of scale because digital platforms often require large upfront costs, but after these initial expenditures, the cost of providing service to additional customers is close to zero. In addition, the companies have been buying startups and competitors to eliminate rivals.

Therefore, it can be assumed that Foundation Model services will be integrated into the existing infrastructure of these monopolies, using the existing search engines as information providing components. Hence, it is plausible that only a few different Foundation Models will be employed to support the authoring of the majority of documents in the world. This means that the strengths, creativity, biases, shortcomings, oddities, and peculiarities of the few original models will be ubiquitous and may affect the culture in many different languages in a consistent way [13, p. 151]. This homogenization can produce extremely high benefits across a large number of applications, but it also can have a profound negative effect in other fields. Kleinberg et al. [48] have called this an ‘algorithmic monoculture’ which could lead to uniform biases, promotion of specific views and theories, consistent and arbitrary rejection, misclassification, or ill-treatment of individuals or groups. Cave et al. [22] even argue that in both everyday news coverage and fantastic literature, artificial intelligence is predominantly portrayed as white because that is apparently still associated with rationality, intelligence, and power. As current antitrust laws do not work for Internet companies, new regulations are required to break up the monopolies [6]. This requires redefinition of markets, requirements for interoperability of services, and a change in the ownership of data to the customer, who can transfer it to another provider.

8.2.5.5 A Legal Regulation of Foundation Models Is Necessary

The automated application of Foundation Models trained on extremely large text collections poses a whole new set of challenges for our society. We want common good, human-oriented systems that are in line with our values, work reliably and are competitive at the same time. We must therefore try to achieve fair and objective results and avoid undesirable consequences. Fair behavior of an application towards all stakeholders, consideration of the needs of users, reliable, understandable and secure functioning as well as the protection of sensitive data are central requirements for the trustworthy use of Foundation Models.

It is well-documented that organizations have often made poor decisions when adopting a new technology [19]. Commercial companies like Google, on the other hand have no direct incentives to increase transparency or reduce social inequalities [73]. In order to make Foundation Models humane and trustworthy, there needs to be a societal understanding of what guardrails, principles and boundaries should apply, how Foundation Model applications should be developed, how autonomous they should be allowed to act and how we want to control them. As a consequence, there are efforts in different countries to define rules for Foundation Models and AI systems.

The European Union proposes a regulatory framework based on the risk associated with an AI application [34]. It defines four risk levels: Minimal or no risk, limited risk, high risk, and unacceptable risk. All AI systems with unacceptable risk (threats to the safety, livelihoods and rights of people) will be banned [13, p. 157]. High-risk applications include critical infrastructure, educational training, biometric and safety components, and have to satisfy a number of strict checks and assessments before they can be put to market. Special transparency obligations apply to systems with limited risk, such as chatbots. Minimal or no risk systems, such as AI-enabled video games or spam filters, can be freely used. The vast majority of AI systems currently in use in the EU fall into this category.

In the US specific regulatory guidelines have been proposed by different agencies [84]. The Department of Commerce is developing “a voluntary risk management framework for trustworthy AI systems”. The Federal Trade Commission lists a number of compliance expectations. These include requirements for adequate training data, the need to test the model to avoid biases, openness regarding the use of data, truthful representation of the model’s performance, and transparency of modeling objectives. Although there is currently no uniform regulation of AI, regulators are advising companies to craft policies and procedures to create compliance by design. This encourages AI innovation, but also ensures transparency and explainability of systems. In addition, companies should audit and review policy usage regularly and document these processes to comply with regulators. A detailed discussion of norms and regulation is given by Bommasani et al. [13, p. 154].

8.3 Advanced Artificial Intelligence Systems

Self-supervised learning is standard in Foundation Models and has led to unprecedented performance gains in language and image recognition tasks. However, human intelligence has more traits that are not covered by this paradigm. In this section, we first discuss, whether Foundation Models are able to produce new creative content. Then we examine how the words and concepts of language can be “grounded” i.e. connected to the corresponding objects and processes of the physical world. Finally, we consider Kahneman’s theory of human behavior and discuss some ideas for improving the current models.

8.3.1 Can Foundation Models Generate Innovative Content?

A long-discussed problem is whether current Foundation Models can generate innovative content, or if they are just stochastic parrots [7] that mindlessly repeat phrases and text snippets acquired from the training data. In the book “Rebooting AI” Marcus et al. [56] argued in a similar way. He calls GPT-3 [43] “an amazing version of pastiche generation, in a way that high school students who plagiarize change a couple words here or there but they’re not really putting the ideas together. It doesn’t really understand the underlying ideas.” As argued above, GPT-3 cannot really “understand” the content it expresses, as it does not have a grounding for words and phrases by the objects and events in the real world.

Johnson et al. [43] prompted GPT-3 with the sentence “Write an essay discussing the role of metafiction in the work of Italo Calvino.” The system generated a concise five-paragraph summary on the topic. The author characterized the resulting text as “lucid and responsive”. When the prompt is repeated, GPT-3 generates a completely new response over and over again. When the author entered each generated sentence into the Google search engine, he could not find any of them. Each sentence was custom-built for that specific prompt. This illustrates that Foundation Models are very good at combining pieces of contents together. However, they do not act on the level of strings and words, but on the level of contextual embeddings, which express the underlying conceptual similarity of phrases and sentences and their relation in a large number of sentences and documents.

This phenomenon becomes even clearer when we consider Foundation Models that simultaneously capture text and image content. As described in Sect. 7.2, models such as DALL-E 2 develop a joint embedding space for image patches and text tokens. In this space images and texts are not related in terms of pixels and strings, but in terms of context-sensitive embeddings of these image patches and tokens. These embeddings are different depending on the overall composition of the image and the text. Generating new content is based on the correlation of these embeddings and therefore can create new combinations of images and text, for instance an image corresponding to “a corgi playing a flame throwing trumpet” (Fig. 7.15) or photo-realistic images illustrating the caption “A teddybear on a skateboard in Times Square” (Fig. 7.16). Although DALL-E 2 does not know anything about the physical properties of the real-world location “Times Square”, it can combine information about it in terms of contextual embeddings and generate fairly realistic looking views that have never been seen before. In this way, Foundation Models can actually generate innovative content.

8.3.2 Grounding Language in the World

A long-standing problem in language research is how machines can “understand” the “meaning’ of language. Bender et al. [8] argue that the “language modeling task, because it only uses linguistic forms as training data, cannot in principle lead to learning of meaning”. Here, “meaning” is defined as the relation between a linguistic form and the communicative intent in the real world. Language modeling in this context is a system for string prediction. According to this view, current language models do not acquire “meaning”, but relate phrases to other phrases.

Perception learning of an infant also takes place in a self-supervised way (Fig. 8.5). Parents and babies are pointing to objects during language development [26], and babies learn the grounded meanings of words that refer to common objects before they learn many other aspects of language [10]. The baby simply observes its environment and, probably, develops some expectation of how the environment (e.g. object movement, view change) will evolve over time (Fig. 8.6). Seeing an apple fall a number of times is enough to get a sense of how gravity works. Moreover, objects do not disappear when they move out of sight. The baby can learn by predicting these changes and unconsciously correcting its expectations whenever a deviation occurs [51]. This corresponds to unsupervised learning in the video domain by predicting the next frames. The NÜWA system (Sect. 7.3.4) is already pre-trained in this way and has achieved Sota for forecasting the next frames of a video.

Fig. 8.5
A model chart represents various activities under motor, speech, vision and hearing, and social. The activities are grouped separately between the age of 0 and 13 months. The number of activities increases after the age of 3 months.

Timeline for the development of infant perception according to Wikipedia [93] and LeCun [51]. Abstract laws of nature, such as the fact that objects are affected by gravity and inertia, are acquired later than simpler concepts, like object permanence and the assignment of objects to broad categories. Most knowledge is obtained through observation, with very little direct manipulation, particularly in the first months

Fig. 8.6
A photograph of an infant lying on the bed and playing with a toy.

A baby observes its environment and manipulates objects. It develops an expectation of how the environment (e.g. object movement, view change) will evolve over time. It predicts these changes and subconsciously learns whenever a deviation occurs. Image credits in Table A.4

If a system is trained only with words, it is difficult to learn a concept. A dog, for instance, is not entirely understood if one knows that it is connected to leashes, ears, cats, mammal, leg, fur, tail, toy, barking, etc. [50]. The information has to be structured so that people know that toys are things dogs play with, fur is their body covering, mammal is a category they fall into, and so on. The head of a dog near to four legs does not constitute a dog. Therefore, the best way to learn the concept of a dog is to perceive it in several media, for example, as an image, in a descriptive text, and in a movie, where it is chasing a cat.

Recently a model called PLATO [68] has been proposed to learn intuitive physics from videos. PLATO decomposes each segmented video frame into a set of objects using a perception module. To each object an ID is assigned to allow object tracking over time. Using a violation-of-expectation criterion, PLATO can learn a number of physical concepts, such as object continuity, directional inertia, object persistence, and object solidity. The approach of the model offers a way to ground intuitive physical concepts in visual perceptions.

It can be expected that self-supervised learning will be extended with the inclusion of more dimensions like 3D, self-movement, and active manipulation of the environment. As LeCun says, “Instead of language or images, however, the next AI generation will learn directly from videos. Meta is currently putting a lot of effort into collecting video data from the first-person perspective for this new AI generation [41], but YouTube videos are also suitable training material” [74]. LeCun believes that AI systems can learn about the physical foundations of our world from such videos. Their understanding, in turn, would be the basis for numerous abilities, such as grasping objects or driving a car.

A more detailed perspective is given by Bisk et al. [12]. The authors argue that language learning has to make a connection to “extralinguistic events”. They distinguish different word scopes for language learning (Fig. 8.7). The most restricted scope contains carefully created corpora like the manually annotated Penn Treebank. BERT was trained on such carefully curated datasets. The next scope covers Web scale data collections, which in the case of PaLM include 780B tokens that are used only once for training. According to the scaling laws (Sect. 3.5.1), it can be expected that with more data and more model parameters, the already high accuracy of language prediction will increase even more.

Fig. 8.7
A Venn diagram represents the supervised corpora inside the webscale data, which is inside the elliptical shape labeled as mixing other modalities. The modalities are inside the embodiment integrating sensors and movement, which is again inside the space of social context and interaction.

World Scopes for Grounding Language. While the first three scopes have been explored to some extent, the remaining two scopes have to be considered in the future [12]

The next scope is to mix language with sensory input from other modalities. This, for instance, is necessary to learn the meaning, the visual impression and implications of a painting. A good way to make progress in this direction is by using datasets connecting images with captions. When video content is subtitled and speech or transcribed speech is also available, even more connections can be made between visual impressions, audio, speech and language. A good example for this scope are the OFA and NÜWA models, but they can be improved in many ways.

If you need to answer the following question: “Is an orange more like a baseball or more like a banana?”, then visual appearance is not enough. Here different features of an orange have to be determined, e.g. weight, mobility, malleability, deformability and taste. This can only be done when manipulating and exploring the orange by hand. Here the next scope is required, where the agent moves and acts in the world and receives various tactile and sensory impressions of self-movement, force, and body position. Only in this way the basic physical properties of the world can be learned from interaction. To make progress in this area, a convergence of Foundation Models and robotics is needed, as initiated by PLATO. Thomason et al. [85] propose to ground language using 3D objects. The current approaches are rather limited.

The final scope is interpersonal communication, which is the central use case of natural language. It is currently not clear, how a computer system can act as an embodied participant in a social context. Dialog models like XiaoIce and LaMDA are a first attempt. These questions are discussed at length by Bisk et al. [12] and are probably more relevant in the distant future.

8.3.3 Fast and Slow Thinking

Intelligent thinking occurs at different speeds. Daniel Kahneman, Nobel Laureate in Economics, has developed a hypothesis [45] about two different systems of thinking from long studies of human behavior (Fig. 8.8). System 1 (Fast Thinking) is fast, instinctive, and emotional. Examples include understanding a simple spoken sentence, driving a car on a quiet road, or recognizing an object in a picture. System 1 runs continuously, generating impressions, intuitions, and quick judgments based on our immediate perceptions.

Fig. 8.8
An illustration of a set of points under the fast thinking system 1, and slow thinking system 2. Some of the features of System 1 are error-prone and continuously scan the environment. Features of system 2 are analyzing and reasoning in special problems.

The properties of the two systems for fast and slow thinking in the human brain according to Kahneman [45]

System 2 (Slow thinking) is slower, more deliberate, and more logical. It is responsible, for example, for remembering a person not seen for a long time, for parking in a narrow parking space, or solving the arithmetic problem 16*34. System 2 is only used, when there are problems with System 1, i.e. it cannot explain the perceptions well.

Corresponding to System 2 in the brain is a working memory with limited capacity [32]. It allows to store thought content for a short time and to manipulate it at the same time. It apparently has an important role in problem solving and logical reasoning. The number of information units that can be handled simultaneously is estimated to be between five and seven. Humans are aware of System 2 thought processes, whereas System 1 processing is largely subconscious. System 2 requires the ability to consider an abstraction of the world. This involves focusing on a limited set of features and processing them in depth, while ignoring others [14].

8.3.4 Planning Strategies

Turing Award winner Yann LeCun [53] argues that current Foundation Models can already process many aspects of the environment similar to System 1. Self-supervised learning is able to capture speech and language well and transform them into each other. To a lesser extent, images can be analyzed and associated to verbal descriptions. Joint processing of video, speech, and text is promising, but needs further development.

Only recently, Foundation Models were able to perform planning (Sect. 7.4), i.e. the systematic future-oriented consideration of goals, means, and ways to achieve goals in the future. This corresponds to Kahneman’s System 2. The Foundation Model basically performs model predictive control and simulates the system under consideration for a series of time steps [75]. An example is driving a car on a road. Here the system simultaneously simulates the state of the system (e.g. position and speed of the car), the actions (e.g. steering wheel movements, acceleration) and the reward (e.g. distance to goal, distance from obstacles). The Foundation Model is trained using a set of observed trajectories and can learn the dependency between states, actions and resulting rewards. Subsequently, it is able to predict the next action to reach a specific reward level. Planning with Foundation Models can already include multiple modalities, e.g. perform a control with images as state descriptions.

According to Yann LeCun “the ability to construct models of the world is basically the essence of intelligence” [53]. These models required are not only to predict physical movements, but also human behavior, economic activity, etc. The great challenge of AI in the next decade is how to learn predictive models of the world that can handle uncertainty.

In LeCun’s view this does not directly require formal logic based reasoning, which is not compatible with gradients required for efficient learning. Yoshua Bengio says [29], “There are some who believe that there are problems that neural networks just cannot resolve and that we have to resort to the classical AI, symbolic approach. But our work suggests otherwise.” It is more probable that reasoning is performed by internal simulation and by analogy. As Geoffrey Hinton puts it: “But my guess is in the end, we’ll realize that symbols just exist out there in the external world, and we do internal operations on big vectors” [39]. It should be noted that newer models such as PaLM, which use chain-of-thought prompts, can reason just as well as average people (Sect. 4.2.3). Language is also not important for the intelligence of animals, it was acquired later in evolution.

LeCun envisions a complex system, where some high-level “configurator” instantiates world models for a current problem on the fly and executes mental simulations [96]. He postulates that there is a single world model engine, which is dynamically configurable for the task at hand [96]. In this way, knowledge about how the environment works may be shared across tasks. A key requirement is that the world model must be able to represent and compare multiple possible predictions of the environment. This configurator has the ability to combine different models and to learn complex hierarchical action sequences. In his concept paper, Yann LeCun [96] discusses many details of such a possible system.

The Gato model combining language, images, and control might be a first step into that direction, but it is still in its infancy (Sect. 7.4.2). The SayCan [2] system is an approach that integrates a robot and a Foundation Model to verbally express the robot’s skill properties, e.g. “pick up the sponge”. Given a real-world task description, SayCan is able to generate a sequence of skill executions to complete the task. In the same way a number of researchers from the reinforcement learning community argue that maximizing total reward may be sufficient to understand intelligence and its associated abilities [79].

Melanie Mitchel agrees with Yann LeCun that current Foundation Models are not powerful enough. “They lack memory and internal models of the world that are actually really important,” she says [40]. In principle these models do not need language. But language has a big advantage, it allows to change goals on the fly simply by including some facts or statements, similar to the few-shot technique. Overall, it can be expected that there will be major advances along these development lines in the coming years.