14.1 Pre-trained Models: New Era of Representation Learning

It is about 2 years after the publication of this book’s first edition. The 2 years have witnessed the astonishing rise of large-scale pre-trained models, also well known as big models or foundation models. With the development of pre-trained modeling techniques, representation learning is exhibiting the following remarkable trends.

Unified Architecture of Representation Learning

Ever since the initiative of parallel distributed processing (PDP) in 1980s, hundreds of neural network architectures have been proposed to address various objects, tasks, and domains, with some landmark architectures such as Hopfield networks [48], Boltzmann machines [2], self-organizing map (SOM) [63], recurrent neural networks (RNN), convolutional neural networks (CNN) [70], long short-term memory (LSTM) [47], ResNet [45], and Transformers [118].

As the evolution of neural network techniques, some architectures step down and others emerge. At the early stage of the deep learning era, there were still many architectures specifically designed with different characteristics. For example, at that time, we have to conduct experiments to find which architectures are more suitable for the given NLP task, among CNNs, GRUs, and LSTMs or their variants. After the neural architecture Transformers was proposed in 2017, especially after the pre-trained language models BERT and GPT built on Transformers as the backbone, we can find optimal neural architectures across various domains and tasks are becoming more and more unified from diverse schemes as shown in Chap. 5. The neural architecture Transformers has been the most widely used backbone across almost all NLP tasks, ranging from natural language understanding to generation.

The unifying trend also happens across multiple modalities, and the neural architecture Transformers has shown its power beyond NLP, to CV as shown in Chap. 7, and to other data such as biomedical structures as shown in Chap. 12. The unified architecture across multiple modalities will help to model rich knowledge of cross-modal interaction and further facilitate learning from heterogeneous data.

Of course, the unification process has not been completed, and there is no evidence showing Transformers will be the ultimate neural architecture for representation learning. It will be an important research topic in the future.

Unified Model Capability for Multiple Tasks

With the “pre-training-fine-tuning” pipeline, pre-trained language models also build unified model capability from large-scale unlabeled corpora for multiple downstream tasks. The unified capability of pre-trained models becomes more significant as model parameters grow into a billion scale with more data and more computation power.

The evidence is their unprecedented power in zero-shot learning and few-shot learning as shown in Chap. 5. For example, we can use no more than 1% additional parameters by parameter-efficient delta tuning to adapt big models to specific complicated tasks. It makes us conjecture that big models may have learned all essential knowledge by pre-training from large-scale corpora, and the function of delta tuning is only to inform big models which internal knowledge should be stimulated for the specific downstream tasks.

The unified model capability revealed by big pre-trained models makes them completely different from conventional machine learning approaches including statistical learning and deep learning. It requires the exploration of a new theoretical foundation and efficient optimization methods conditioned on pre-trained big models.

Moreover, with the above-mentioned characteristics of unified model architecture and model capability, we believe pre-trained models to some extent indicate the maturity of distributed representation learning for AI, with great potential for extensive use in each area requiring AI for assistance. It will open a new era of AI and NLP from research to application. Standing on the new giants of big pre-trained models, there are also many new challenges and opportunities for representation learning. Here we summarize ten key open problems for pre-trained models and hope more efforts will be devoted to these problems and promote wide applications of big model techniques.

14.2 Ten Key Problems of Pre-trained Models

In this section, we summarize ten key open problems of pre-trained models, including theoretical foundation, next-generation architecture, high-performance computing, parameter-efficient delta tuning, controllable generation, safety and ethics, cross-modality, cognitive learning, innovative applications, and big model systems.

Note that these open problems are raised based on our research experiences on pre-trained models and deep learning. It does not indicate other problems beyond these ten are not or less important.

14.2.1 P1: Theoretical Foundation of Pre-trained Models

As pre-trained models (PTMs) [9, 42] become the infrastructure of modern NLP, the theoretical principles behind them become exceedingly intriguing to the community. Self-contained and rigorous mathematical theories could efficaciously guide the ameliorations of neural structures, pre-training objectives, and adaptations of PTMs and even pave the road to more powerful artificial intelligence. However, the sad truth is that we are still far from a complete understanding of PTMs. Their mechanism intersects with deep neural networks, transfer learning, and self-supervised learning in an intricate way, and moreover, considerable empirical evidence suggests that the potential of PTMs has not been fully explored.

The specialty of PTMs comes from the universal generalization capability expressed by adaptations to various tasks. Constructed on the basis of deep neural networks (typically deep Transformers [118]), PTMs are firstly pre-trained on massive unsupervised corpora and then adapted to particular downstream tasks. After optimizing a general language modeling objective in the pre-training phase, PTMs are able to yield tremendous generalization capability on a wide range of NLP tasks that involve language data, even with a few examples and a small amount of optimization [30, 50, 71].

In this subsection, we hold the mindset of seekers and discuss the theoretical foundation of the miraculous generalization capability of PTMs by decomposing it into several sub-questions.

What Is the Appropriate Mathematical Description of the Generalization Capability?

When dealing with machine learning and deep learning models, calculus, linear algebra, and probability theory are among the most common choices as tools, while more advanced (and complicated) mathematics are almost untouched at the current stage. This may limit our understanding because real linear and nonlinear operations in the representation space are difficult to inscribe with these tools. Some argue that the probability theory framework that is widely used to describe generative models is intractable in the situation of capturing the correlations of high-dimensional variables [96]. Under this circumstance, other mathematical tools need to be adopted and evaluated to interpret the utilities of neural networks and even PTMs [43, 127]. For example, recent progress in geometric deep learning [11] elaborates different types of neural networks through the lens of symmetry and invariance, bringing new inspiration to the community. There are also works that attempt to provide mathematical frameworks for the revolutionary trigger point, i.e., the Transformer model [34]. Nevertheless, merely attempting to elucidate the neural network architecture may still be insufficient to understand PTMs, and grasping the relationship between pre-training and adaptation is crucial as well.

Why Does Pre-training Bring the Generalization to Downstream Tasks?

Compared to traditional deep learning, the most obvious difference, and the key to success, is the far-flung pre-training phase over numerous data. The simplicity of the pre-training task and the effortlessness of the adaptation to complex tasks urge us to wonder about the principles of how pre-training and adaptation are related. From a vague point of view, PTMs’ colossal capacity makes it possible to induce a type of general knowledge, while adaptation is a process to expose such knowledge [142]. This is, of course, an incomplete and unverifiable explanation, but a series of delta tuning [30] efforts implicitly guided by this insight has yielded remarkable results in a parameter-efficient manner. Taking a closer and simplified look, such knowledge can be modeled as coherence structures in a latent space with the Bayesian framework [128, 134]. Switching to another pragmatic perspective, analyzing the loss landscape may bring new insights into the relationship between pre-training and adaptation [73], where the pre-training phase produces a readily optimizable initialization landscape for PTMs surrounded by local optimums. Modern supervised learning theory aims to explore the bounds of theoretical adaptation loss via empirical adaptation loss and generalization errors. And studies of self-supervised learning borrow from progress in supervised learning theory to bound from the adaptation loss with pre-training loss in certain preconditions [4, 43, 116, 134]. Although analysis of pre-training and adaptation could move our understanding of PTMs one step forward, the special capabilities that come with model scaling take the ultimate goal even further.

How Are the Model Capacity and Capabilities Related?

One of the most fascinating empirical observations of PTMs is the power expressed by merely scaling the size. It is not just a matter of accuracy on standard classification or generation tasks; large PTMs will counter-intuitively emerge with unprecedented capabilities as the number of parameters increases. Models with tens of billions of parameters would give surprising adaptation performance a small number of trainable parameters prepended to the input layer [71]. GPT-3 [13], a model with 175 billion parameters, shows an extraordinary capability of in-context learning, which uses several examples to stimulate the model to imitatively make predictions without tuning a single parameter. Large-scale models could even directly learn from tokenized behaviors of humans to carry out complex tasks such as using search engines [89] and playing sandbox games [7]. Experimental studies indicate that special capabilities of large models do not accumulate linearly but emerge at a certain point [129]. Although such power of scale is verified under different scenarios, it is still hardly framed theoretically.

The success of PTMs could simultaneously attribute to data, objectives, and neural architectures, and it seems to be difficult to separate modules of the process and study them independently without interfering with each other. Overall, the exploration of the theoretical foundation of PTMs is a necessarily arduous journey, whereas any promising conclusions could have profound influences. We encourage the readers of our book to keep an open mind and attempt to apply theoretical tools beyond NLP, machine learning, and even computer science to analyze the behaviors of PTMs and develop corresponding frameworks.

14.2.2 P2: Next-Generation Model Architecture

It has already been 5 years since Transformer was first released. The high capability and ease of parallelism have enabled Transformer-based models to efficiently scale up and achieve near-human or even beyond-human performance on numerous tasks. During these 5 years, we have witnessed the boom of Transformer-based PTMs and the realization of more and more previously unimaginable goals one by one [13, 89]. We have also been a witness to the spreading of Transformer’s territory from NLP to other fields such as computer vision [32, 80], robotics [39, 106], etc. Undoubtedly, Transformer must be one of the most revolutionary model architectures in the history of deep learning.

Despite the power of Transformer-based models, as we have introduced in the first key problem, there is still not yet a sound theory that is able to elucidate the mechanism of Transformer. Besides, Transformer is a data-hungry and resource-intensive architecture, and the problem is further exacerbated as the model size increases [1]. Though Transformer is an epoch-making architecture, we still believe that it will not be the ultimate of neural networks. A natural question we would like to ask is what could be the next-generation architecture for neural networks?

From a historical perspective, we find that many of the earlier breakthroughs in neural networks were inspired by other disciplines. For example, the convolution in CNNs is borrowed from the research on the receptive field in cats’ visual cortex [53], and the memory in LSTMs is also designed to mimic the mechanisms of the human brain. Therefore in this subsection, we will stand at the intersection of different disciplines and focus on neural network architectures that are inspired by different fields. Specifically, we will introduce some architectures inspired by dynamical systems, geometry, and neuroscience. While, at this time, these structures may not be able to outperform Transformer significantly, they all have their own potential and their own strengths that are still worth paying attention to.

Dynamical Systems Inspired Architectures

A dynamical system is a system whose state is evolving over time, e.g., the random motion of particles, where the location of each particle changes over time. Looking at the propagation of hidden states between different layers of a deep neural network, it is intuitive to associate it with a discrete dynamical system by interpreting the layer depth as time step. Indeed, many works have drawn the connection between deep neural networks and discrete dynamical systems described by ordinary differential equations (ODEs) [82, 132]. The hidden state propagation in ResNet [45] exactly resembles the forward Euler discretization of an ODE. Therefore, the computation in ResNet can be seen as implicitly solving an ODE defined by the model parameters. Apart from the dynamical systems described by ODEs, the dynamical systems described by controlled differential equations (CDEs) [17, 101] and stochastic differential equations (SDEs) [61, 76] are also shown to be closely related to neural networks.

A number of advantages stem from the dynamical system perspective of neural networks. Examples are as follows:

  1. (1)

    GPU memory efficiency. By introducing the adjoint state method [15] in the numerical optimization problem, the GPU memory consumption can be reduced from \(\mathcal {O}(L)\) for ResNet (L denotes the number of layers) to \(\mathcal {O}(1)\) [20, 76].

  2. (2)

    Adaptive computational time. Ideally, models should spend less time on simple samples and more time on complex ones. However, current architectures treat the instances with different complexity equally. By leveraging the adaptive step-size solvers in numerical optimization literature, models can have adaptive time costs for different instances [18, 37].

Through the perspective of dynamical systems, neural networks can be naturally generalized to continuous systems, and plentiful theories in the dynamical systems can step in to inspire new designs for neural networks. We believe it is a promising area to explore.

Geometry Inspired Architectures

Humans live in a Euclidean world. Therefore, we naturally accept the assumption that the geometry of the neural networks should also be Euclidean. However, this is not the case, as the data that the neural networks handle differs from what we are exposed to. Many complex data, such as graph data, have been shown to exhibit non-Euclidean properties [12]. Intuitively, when the neural networks are also non-Euclidean, they should be able to handle the data better due to the matching of the geometry.

Considering the non-Euclidean geometries in the neural networks brings several benefits: (1) Greater capability in modeling structured features both theoretically and empirically. Many real-life graphs are known to be tree-like. However, even when the dimension of the Euclidean space is unbounded, tree structures still cannot be embedded with arbitrarily low distortion, i.e., some information will always be lost. However, it can be easily achieved in a two-dimensional hyperbolic space, which is a non-Euclidean space [102]. And in practice, there have also been a lot of graph-related works demonstrating the effectiveness of low-dimensional hyperbolic models [16, 22, 91]. (2) Combinability with the dynamical system. Geometry can also collaborate with the dynamical system we mentioned above. From the perspective of geometry, the layers in neural networks can be seen as transformations on the coordinate representation of the data manifold. From the perspective of the dynamical system, the depth of neural networks can be continuous. When combined, it is possible to give a continuous transformation process from the data manifold to the final linearly separable manifold for different classes [12, 84]. It has the potential to provide a more intuitive understanding of how neural networks gradually transform the data from input features to features that can eventually be used for classification.

In all, non-Euclidean geometry offers a prominent direction for neural networks. It is a promising approach to handle the structured data and to combine with other perspectives to offer better insight into neural networks.

Neuroscience Inspired Architectures

When thinking, unlike neural networks, we don’t need to consume large amounts of energy, nor do we spike our brain temperature to near 100 C. Although still called neural networks, today’s artificial neural networks (ANNs) have already become much more energy-hungry and resource-demanding than the human nervous system. Compared to ANNs, the sparsity of human brains allows them to consume much less energy than ANNs. Therefore, inspired by the sparsity of neuronal interconnections in the human brain, researchers have experimented with designing neural networks with sparsity from two dimensions: spatial sparsity and temporal sparsity.

The human brain has sparse neuronal connections and relatively distinct functional partitions. That is, neuronal connections in the human brain are spatially sparse. This allows us to accomplish a simple task without using neurons from the whole brain. Inspired by the spatial sparsity, the mixture of experts (MoE) structure is proposed [33, 54]. Unlike conventional neural networks which are densely connected, MoE divides each layer into several experts and additionally includes a router to route every input to only a few experts. Since not all the experts are involved in the computation, the inference can be faster than densely connected networks. The advantage of MoE models in terms of computational cost allows them to scale up very efficiently. In addition, because different inputs will be processed by different experts, ideally, different experts can learn to handle different aspects of a task (or even multiple tasks), making it suitable for artificial general intelligence. Indeed, MoE models have been shown to reach state of the art on several benchmarks with fewer computational cost [88].

In addition to spatial sparsity, the human brain also exhibits temporal sparsity, i.e., neurons do not transmit signals every time step. Spike neural networks (SNNs) [38] mimic the behavior of information propagation between neurons interconnected by synapses. When the pre-synaptic neuron is activated, it sends a signal in the form of synaptic current to the post-synaptic neuron, and the current strength is proportional to the weight of the synapse. The incoming synaptic currents change the membrane potential of the post-synaptic neuron, and when the membrane potential reaches a certain threshold, the post-synaptic neuron emits a spike, and its membrane potential is reset to its resting potential. The biggest advantage brought by the SNNs is the extremely low energy consumption. Because SNNs only consume energy when emitting spikes, the energy consumption of SNNs can be extremely low compared to mainstream neural networks [60, 87, 112]. Also, neuromorphic, which is the specialized hardware for SNN, allows both computation and parameter storage on the same chip, further boosting the efficiency [26]. Although the performances of SNNs are often slightly lower than the mainstream neural networks on datasets such as MNIST [70] and CIFAR-10 [66], the low energy characteristic of SNN makes it promising for the future.

Looking back at history, in 2012 AlexNet [67] was proposed, and since then deep neural networks such as CNNs and RNNs take the lead in machine learning. Five years later in 2017, Transformer was introduced and gradually replaced the models such as RNNs. Now, in 2022, we are celebrating another 5-year period, wondering what could be the next-generation neural network. We believe Transformer will not be the ultimate of the neural networks, and we are eager to see more researchers think about and explore the next-generation neural network architecture and propose more economical, more efficient, and more effective models.

14.2.3 P3: High-Performance Computing of Big Models

Numerous parameters of big models come with exceedingly expensive computation and storage costs, imposing substantial challenges on both training and inference. In fact, improving the computational efficiency of big models is a complicated process in which many fundamental aspects should be considered. In particular, meliorations across the computational infrastructure, algorithms, and specific applications can be simultaneously conducted. In this subsection, we discuss high-performance computing of big models from these three perspectives.

High-Performance Computational Infrastructure

We collectively refer to the hardware and software as the computational infrastructure, which is the foundation for both the training and inference of big models and even deep neural networks. In general, high-performance computational infrastructure can be further exploited from the following directions: (1) Parallel computing methods, including data parallelism [113], tensor parallelism [52, 90], pipeline parallelism [104], and hybrid parallelism [95], could fully utilize distributed computing capabilities to accelerate the computation of big models. (2) We should take advantage of heterogeneous computing devices [56], including multi-level computing devices consisting of GPUs and CPUs, and multi-level storage devices consisting of VRAMs, RAMs, and disks, to reduce the computing cost while ensuring the computing efficiency. (3) Considering big models have large-scale parameters, we should investigate techniques to reduce the memory overhead, including tensor offloading [100, 107] and tensor rematerialization [21, 62], facilitating us to compute bigger models using fewer computing devices. (4) Moreover, high-performance tensor programs [122] are also critical for making deploying big models efficient, especially sparse tensor programs [149] considering the sparsity of neural networks.

High-Performance Algorithms

Existing work on big models enjoys the emergent ability that comes with increasing parameters while ignoring the efficiency of the parameter utilization. If we draw an analogy between big models and the brain, we will find that the brain consumes much less cost for the similar billions of parameters (neurons) due to some enigmatic mechanism brought by evolution. Recently, two Turing Award Winners, Yoshua Bengio and Yann LeCun, also highlight the importance of neuroscience for AI [140], and they believe that the next generation of AI will be largely driven by neuroscience. Hence, it is a promising way to design new algorithms by utilizing knowledge of neuroscience. We will discuss several important brain-inspired mechanisms as examples and hope these methods can inspire more explorations. (1) Learning from memory mechanisms of human brains [115]. We should build an explicit memory system to store the information and retrieve relevant pieces for a given input instead of computing all parameters [40, 72]. (2) Learning from System 1 and System 2 of human brains [25]. We should design a system that can automatically switch between the fast and the accurate modes for inputs with different levels of complexity [135]. (3) Inspired by recent work highlighting the importance of cooperation between brain regions [114], we should also explore how to compose multiple big models to achieve better performance [3], which is more efficient than training a bigger model from scratch.

High-Performance Application

When dealing with limited resources of edge devices such as mobile phones, our approach should shift from squeezing the performance out of computing devices to compressing the big models themselves for efficient deployment. As introduced in Chap. 5, there are many compression techniques, such as knowledge distillation [46] and parameter pruning [41], that could compress big models to acceptable scales. Overall, in terms of high-performance applications, we believe the following future directions show considerable potential. (1) Computing hardware sets boundaries for our compression techniques. To this end, properties of application hardware must be considered to find the best compression architecture with minimal latency [121] or energy costs [125] rather than FLOPs. (2) Different downstream tasks may exhibit different characteristics, thereby requiring compression strategies with disparate focuses. We should explore task-aware compression to utilize the specific patterns of different tasks, such as vocabulary reconstruction [136] for tasks of a specific language and decoder-oriented compression for generation tasks [77]. (3) Many compression approaches could achieve similar results but are orthogonal in technical aspects. To this end, we could take advantage of multiple compression techniques to achieve higher compression ratios. Some preliminary works have begun to investigate combinational compression and have already achieved some promising results [148]. However, how to combine all existing methods to achieve optimal inference acceleration within an acceptable performance degradation still remains an open problem.

The development of high-performance computing is an important driving force for deep learning, especially for big pre-trained models. In the past, the performance gains have mainly come from the growth of computing power. In the future, we need to devote more efforts on how to improve the utilization of computing power. On the one hand, it can lower both the bar of using big models for anyone who is interested in AI and the carbon footprint of computing big models. On the other hand, in the post-Moore era, there is limited room for further improvement in computing power, and new methods should shift from relying on the growth of computing power to improving efficiency.

14.2.4 P4: Effective and Efficient Adaptation

Before the arrival of the era of PTMs, empirical improvements of NLP applications are primarily achieved by considerations across aspects of models, algorithms, task-specific characteristics, etc. After PTMs take the stage, researchers find that prominent advancements in almost all NLP tasks can be delivered by merely scaling up PTMs. Such a success of scaling, despite elusive, has fueled a surge of development of big models with billions [93] and even hundreds of billions of parameters [13]. Accordingly, the emergence of big models triggers provoking explorations of advanced model adaptations, which suggests that the full-parameter fine-tuning approach used in early PTMs is not the optimal solution for model adaptation. It is neither effective across all forms of datasets nor economically efficient on common computation devices. That is to say, the inherent characteristics of the big model itself must be taken into account, and innovative strategies for model adaptations should be established. To this end, how to effectively and efficiently adapt big models becomes a pivotal research issue. The problem is threefold in this subsection, including computationally practical adaptation, task-wise effective adaptation, and advanced adaptation with complex reasoning.

Computationally Practical Adaptation

The huge size of big models is a blessing in terms of experimental performance, whereas a curse in terms of the adaptation process. Deploying and adapting these models to assorted tasks require considerable computational and storage resources that are prohibitive to common researchers. Instead of updating all the parameters of big models, recent studies of delta tuning [30, 49, 50, 75] find that only a tiny portion of parameters could yield comparable or even better performance of full-parameter fine-tuning. These trainable parameters can be represented as different structures or in different positions in big models. But a consistent empirical characteristic is that the larger the model, the better the performance of this paradigm. Delta tuning reifies conceptual capabilities to solve particular tasks in a concrete and lightweight manner. The resulting lightweight delta objects are easy to store and share across tasks and users, imposing considerable maneuverability on big models and unleashing the imagination of the industrialized use of these behemoths. Despite the efficiency, there are dark clouds still hanging over this topic. For example, it is difficult to assess the optimal amount of tunable parameters for different tasks, and the convergence of delta tuning is relatively slower than full-parameter fine-tuning. In addition, the theoretical principles behind the success of delta tuning can also help the community further understand big models. The revolution in terms of model adaptations does not only occur at the parameter optimization level but also at the level of data and tasks. Next, we take prompt learning as a landing point to discuss the task-wise effective adaptation of big models.

Task-Wise Effective Adaptation

Taking BERT [29] as an example, PTMs in the early stage first produce representations for current inputs and adopt extra classifiers to carry out adaptations to downstream tasks. This seemingly established approach may actually be counter-intuitive since there is a considerable chasm between pre-training and adaptations. Empirical evidence shows that inserting additional contexts, i.e., prompts, and transferring downstream tasks to pre-training tasks could substantially shrink the gap and yield promising performance, especially in the low-data regimes. Prompts could be generated and constructed by different means and forms, but fundamentally, this technique implies a trend of unification of NLP tasks, which includes the unification of pre-training tasks and downstream tasks, as well as the unification between different downstream tasks. Prompt learning has shown intriguing attributes such as zero- and few-shot learning, task generalization, and structural unification of datasets. Besides, the flexibility of prompts makes it possible to smooth the logic chain of big models and stimulate complex reasoning capabilities.

Advanced Adaptation with Complex Reasoning

The reasoning capability of big models has been a long-standing debate that no one can perfectly arbitrate, where the existence, representation, and stimulation methods have been suspending research questions for years. Intuitively for human beings, solving more complex questions is almost equivalent to more comprehensive reasoning ability. When it comes to big models or, more generally, neural networks, continuous studies about shortcuts and record-breaking performance of complex tasks create a confrontational situation. With no intention of philosophizing the argument, we look at this only from the perspective of performing complex tasks, where big models could produce striking logical processes in numerical and commonsense reasoning tasks [130]. Consistent with the aforementioned two points of computationally practical and task-wise effective adaptations, such reasoning capabilities emerge at a certain point of model scaling, which implies that models should have sufficient capacity and be trained on sufficient data in pre-training to elicit complex reasoning. However, such reasoning abilities to perform complex tasks are not stable in practice, where they show different variances for different data and are extremely demanding in terms of stimulation manners. This puts researchers in the awkward position that we are all vaguely aware of the enormous potential of big models, but have few clues about how to hit that upper limit.

In summary, research considerations of big model adaptations could be encapsulated in three points according to the above statements: First, big models should be computationally practical so that they can fully replace previous approaches when their training and storage are no longer an unattainable goal for the community. Delta tuning is a highly prospective attempt at the algorithmic level, and perhaps the community also needs to make efforts on computational systems and hardware. Second, the predictive power of big models could be realized by new types of data and task organization, and prompt learning is the product brought by the development of big models, which also pushes us to adopt a more unified perspective when looking at the tasks. Finally, to further tap the potential of big models, complex reasoning must be explored, and this is a key step for artificial intelligence to enter the cognitive level instead of making simple predictions.

14.2.5 P5: Controllable Generation with Pre-trained Models

Generating data distribution is a long-standing challenge for the machine learning community due to its inherent high dimensionality and intractability. Fortunately, the unprecedented capabilities accompanied by PTMs have brought this goal within reach and thus sparked a new surge of research. In empirical inspections of large-scale PTMs, researchers have discovered their impressive ability to generate high-quality text [13], images [94], videos [108], or programming codes [19]. However, PTMs are black boxes, which make us passively accept the generated results rather than actively controlling the model to produce contents that match a specific requirement. How to precisely introduce conditional constraints to control the generated results poses a major challenge for PTMs. Specifically, the challenge of controllable generation comes from three facets: a unified framework for diverse controls, the compositionality of controls, as well as a well-recognized evaluation benchmark.

A Unified Framework for Diverse Controls

The primary objective of controlled generation is to meet the diverse practical desires of users concerning content, features, and styles. Diverse controls result in dispersed research efforts. For example, depending on the category of the input, separate models are trained for generation from paragraphs [36], dialogues [145], tables [109], etc. Regarding the properties of the generated text, requirements for sentiment orientation [51] or keyword satisfaction [147] are accomplished by distributional change or insertion-based methods, respectively. In spite of the proliferation of works on diverse controls, we would prefer to use a unified framework to accomplish all these controls rather than designing specific methods to meet each requirement. A unified framework can not only encourage research to be iterated rapidly and convergently but also enable the investigation of the relatedness and combinatoriality of diverse controls. Recently, there have been several research works in this direction: (1) Prompt-based methods. Either by injecting a control code [58] or continuous parameters [75], we can leverage the same PTM with diverse controls. The major drawback is that prompt-based methods usually have coarser control granularity or smaller control power, thus incapable of handling hard constraint tasks like copying a span of text. (2) Distribution modification methods. By incorporating different constraints in the decoding stage of the language model [78], the generated text from the same PTM can be steered from different directions. Its limitation is that distribution modification methods may hinder the fluency of generation [59]. Hence, how to combine the two approaches or develop novel approaches for unification are still open questions.

Compositionality of Controls

In addition to the diversity, controllable generation is also expected to be multidimensional and multi-grained to allow more intricate combinations of controls. As discussed in Chap. 3, compositionality, which studies how to use low-level linguistic units to form high-level semantics, is a topic of considerable interest in text representations [86] and natural language understanding. It is less explored in the context of controllable generation due to the dispersal of control approaches. To this end, the advocates of a unified framework for generation can contribute to compositionality. To steer the generation toward multiple control requirements simultaneously, combining prompts with individual functionalities can be explored to form more comprehensive capabilities [92]. Nonetheless, the exploration is still primitive, with the simple concatenation of prompts as the composition method. As yet, we do not have an understanding of the internal mechanism of controllable generation for PTMs, making it difficult to develop advanced compositional control methods. Of course, we also look forward to other novel approaches that can achieve compositionality of controls.

Well-Recognized Evaluation Benchmark

As ImageNet [28] in computer vision and GLUE Benchmark [120] in natural language understanding have demonstrated, a recognized benchmark can foster benign competition among researchers and identify promising approaches. However, such a benchmark is absent for generation tasks, especially controllable generation. The problem is further compounded by the fact that researchers may use different assessment methods and different data when focusing on the same aspects of controllability [64, 78]. We highlight three aspects of the difficulty of establishing a benchmark for controlled generation and the potential improvements. (1) Firstly, human language is rich in expressions, and the same meaning can take on many nuances. So any golden answer is not sufficient. A possible solution is to create semantic matches between utterances. This requires a powerful semantic understanding model that can provide reliable matching scores from diverse angles. The previous works, e.g., BERT-Score [146], are still insufficient in this regard. Whether the large PTMs like GPT-3 could be used to provide powerful semantic matching is still an open problem. (2) Secondly, control requirements are intractable and diverse. For example, topic satisfaction or emotional tendencies are difficult to measure quantitatively. Considering the diversity issue, how to integrate the criteria into a unified implementation that can be used across the community is a complicated but urgent task. (3) Thirdly, evaluation should take into account potential degraded factors such as quality and efficiency. Some works [78] point out that there is an inevitable trade-off between the control’s satisfaction rate and text quality. Additionally, either increasing the length of input via prompts or applying complex decoding strategies will sacrifice generation efficiency, which should also be taken into consideration for a well-rounded evaluation. Due to the aforementioned challenges, few attempts have been made to unify the evaluation, and a universally recognized benchmark is still urgently needed.

Controllable generation is important in all areas of AI. The approaches to controllable generation are not unified across tasks, and this in turn leads to difficulties in compositionality of various control approaches. Further, the challenge of controllable generation is exacerbated by the lack of a well-recognized evaluation benchmark. Advances in the above three directions will greatly contribute to the controllability of generation and thus make generation techniques better serve practical needs.

14.2.6 P6: Safe and Ethical Big Models

With the exciting progress made in recent years, big models are deemed as cornerstones of modern NLP as well as AI. However, responsible AI research calls for clear recognition of both benefits and risks. While the benefits of big models are under extensive exploration, we should also be concerned about the underlying negative impacts and harms to individuals and society before deploying big models in the real world. In Chap. 8, we have discussed the robustness requirements for NLP models, and most topics are related to model safety or ethics. Although considerable efforts have been devoted, there still remain major challenges to solve and possible future directions to explore. In this section, we discuss open problems toward safe and ethical big models from the perspective of evaluation, governance, and construction.

Evaluating Safety and Ethical Levels

The very first challenge in building safe and ethical big models is how to conduct rigorous and comprehensive evaluations. For model safety, we have introduced several essential threats against NLP models in Chap. 8, including backdoor attacks, adversarial attacks, and distribution shifts. However, a golden standard of model safety has not been reached, which means we still have no comprehensive safety evaluations. As the deployed models are continually exposed to complex external environments, there are emerging risks, and we wonder if the models are robust to such risks. Tramer et al. [117] figure out that the majority of adversarial defense methods fail to work when attackers adapt their attack strategies accordingly. This suggests that safety over known threats is not enough, and the underlying unknown threats should also be taken into consideration.

Measuring the ethical level of models is even more complicated. It is observed that big models could generate stereotypical or hateful comments about certain groups of people [131], disseminate false or misleading information [144], and leak private information from training data [14]. Obviously, these behaviors violate human values and thus are undesirable. However, it is easy to find individual cases, but rather difficult to conduct rigorous measurements since the human values are hard to specify. Given the social and regional diversity, there does not exist a static and universal rule to assess ethical levels. Worse still, values about politics, religions, and ethnicity are always conflicted across groups, making it even harder to evaluate. Under such conditions, datasets and benchmarks in this research field need to be carefully checked for valid measurement. We also suggest researchers cooperate with sociologists to gain theoretical insights.

Governing Big Models

Given the potential safety and ethical risks of big models, how to cooperate correctly with big models is an essential problem for the AI community, which is referred to as model governance. However, big model governance is challenging both technically and non-technically. On the technical side, big models are capable of completing various downstream tasks via simple adaptation, which also include harmful ones such as generating offensive speech or fake news. Due to the black-box architecture of big models, finding and disabling these harmful functionalities can be difficult. Although practitioners adopt some effective approaches like keyword filters, they cannot guarantee the models are fully governed [119], leaving this problem open for future research. On the non-technical side, model governance is not only about the research community but also about achieving principles and laws across model providers and users, which requires multi-party cooperation. We are glad to see that some responsible organizations are contributing in this area [24] and appeal to more researchers to help advance this important direction.

Building Inherently Safe Models

Another fundamental question about model safety is how can we learn inherently safe models? In Chap. 8, we introduce approaches to solve robustness issues, but most methods we mentioned are targeted at specific problems except pre-training. However, while it has been widely acknowledged that bigger models may make fewer mistakes, we still argue that scaling models and data sizes is not the elixir to eliminate safety problems because an inherently safe model does not equal a model making no mistakes. Instead, to achieve human-level robustness, the models should (1) know what they know and do not know (i.e., calibrated) and (2) learn from mistakes and correct themselves [69, 83]. In this regard, current big models are still far from inherently safe, and we hope to see more efforts devoted to this fundamental problem. Toward inherently safe models, we figure out two possible directions. (1) Incorporating knowledge. In Chap. 9, we see the remarkable success made by injecting knowledge into PTMs. On model safety, incorporating knowledge can help as well. For example, models won’t be fooled by “U r stupid!” if they possess phonetic knowledge. Hence, we recognize building knowledgeable big models as a reliable approach for model safety. (2) Cognitive learning. Nowadays learning paradigm for big models is still data-driven, which cannot fully reflect the underlying risks in the real world. Different from models, we human beings can actively interact with the world and consistently gain knowledge. Moreover, we also largely benefit from the “trial and error” process and learn how to avoid mistakes. Therefore, we address the importance of learning from cognition and interaction for building safe models [65], and we further elaborate on this topic in Sect. 14.2.8.

Safety and ethics are two long-standing topics in AI, which are even extensively discussed in literature and artworks (e.g., Isaac Asimov’s “Three Laws of Robotics” [5]). In the worry of runaway powerful machines, we present several key challenges and future directions for this open problem. We stress that, in the context of nowadays AI hype, we researchers especially need careful consideration before we take every single step and take responsibility for the healthy development of big models.

14.2.7 P7: Cross-Modal Computation

Building intelligent agents that can think and behave like humans is a long-standing goal of AI. An important and appealing characteristic of human intelligence is the impressive capability of perceiving and handling information from different modalities. Recently PTMs have greatly pushed forward the development of intelligent agents in single modalities (such as text [29], image [44], and audio [31]) and also led to breakthroughs in cross-modal computation. By exploiting self-supervised signals in large-scale cross-modal data, generic representations connecting different modalities can be effectively pre-trained and transferred to facilitate various downstream tasks. Cross-modal PTMs based on the pre-training-fine-tuning paradigm seem to constitute a promising foundation to realize such cross-modal intelligence. To this end, we discuss several promising directions for advancing cross-modal PTMs in this subsection, including big cross-modal models with efficient pre-training and adaptation, more unified representation with more modalities, and embodied cross-modal reasoning and cognition.

Big Cross-Modal Models with Efficient Pre-training and Adaptation

Existing works show that impressive capabilities can emerge in pre-trained language models when the model capacity (e.g., number of parameters) substantially scales up. For example, the 175B GPT-3 is able to perform in-context few-shot learning and chain-of-thought prompting for complex tasks. However, although cross-modal pre-training on deep Transformers has pushed forward the state of the art of various tasks, compared with language models, cross-modal models are typically limited in parameter sizes. This hinders the exploration of cross-modal PTMs to more advanced capabilities and tasks. An important reason is that compared with big language models, it can be even more expensive to pre-train and adapt big models that deal with multiple modalities. Some works have explored more efficient pre-training by reusing unimodal models that have been well pre-trained and focusing on connecting PTMs from different modalities [3]. Some works have investigated the efficient adaptation of vision-language models in terms of both data [3, 126, 139] and parameters [150]. In the future, more efforts can be devoted to efficient pre-training and adaptation of big cross-modal representation learning models.

More Unified Representation with More Modalities

Traditional cross-modal works typically design highly specialized model architectures to maximally exploit the inductive bias of modalities and tasks. For example, RNNs are designed to model the sequential dependency of text, and CNNs are developed to model the shift and scale invariance of images. The learning signals usually come from the human annotation of specific tasks. However, designing specific model architectures and learning signals for different modalities and tasks requires extensive expert knowledge, and it can be problematic to maintain a model for each of the large number of tasks. With the development of deep cross-modal pre-training, cross-modal representation learning models are becoming more unified in terms of model architectures and learning mechanisms [74, 138]. Most recently, some works have shown promising results in using unified model architectures, parameters, and learning mechanisms for unimodal, cross-modal, and embodied tasks [97, 123, 124]. Some works have explored pre-training with more modalities, including text, image, and audio [79]. In the future, building a unified representation learning model that can simultaneously deal with various modalities and tasks will be a promising foundation and path to realizing general intelligent systems.

Embodied Cross-Modal Reasoning and Cognition

Semantic recognition capability has been extensively investigated in different modalities, e.g., named entity detection from text and object detection from images. For more complex reasoning and cognition capabilities, obstacles have been encountered in different ways: (1) For modalities with low information density, such as images and audios, semantic recognition can already be a challenging task [98], let alone more complex reasoning [143]. (2) For text which has high information density, it can be more natural to perform complex reasoning based on the abstract symbolic tokens, and recently big language models have shown promising results in commonsense and mathematical reasoning [130]. However, many AI researchers believe that true recognition capability cannot arise from learning only from text [8]. Research in cognitive science also shows that the human mind is highly shaped by embodied learning [133]. Therefore, a more promising direction will be an embodied cross-modal reasoning model. The concrete signals from other modalities can be effectively aggregated into a text-based central unit for high-level semantic reasoning. Some attempts have been made [10], and we believe that the direction is worth more exploration.

In summary, as an important interdisciplinary area that connects information in different modalities, cross-modal computation is essential and beneficial to various real-world AI applications and is also one of the key problems to more general intelligent systems. With their recent rapid development, cross-modal PTMs have become a new foundation in advancing toward this goal. We believe that developing an efficient big cross-modal PTM that can deal with various complex embodied reasoning tasks in a unified fashion will be a promising direction.

14.2.8 P8: Cognitive Learning

An essential measurement of general AI is whether neural models can correctly perceive, understand, and interact with the world, i.e., the cognitive ability. A prototype of general intelligence can be viewed as the capability of manipulating existing tools (e.g., search engines, databases, web-side mail systems, etc.), conducting cognitive planning with complex reasoning, and interacting with the real world to acquire and organize information/knowledge.

Serving as the foundation for AI, PTMs have pushed state-of-the-art performance in a variety of downstream tasks. The rich language knowledge, world knowledge, and commonsense knowledge stored in PTMs determine their unique advantages in cognitive modeling. Efficiently utilizing such knowledge conduces to stimulating the cognitive ability of PTMs, based on which PTMs could effectively interact with the real world in complex scenes. Despite the great success, current PTMs still cannot handle advanced cognitive tasks. To bring PTMs human-level cognitive intelligence, we identify three core challenges for achieving general cognitive intelligence:

Understanding Human Instructions and Interacting with Tools

How could PTMs better understand the user’s instructions and interact with existing tools to complete a specific task? Fulfilling this goal requires precisely (1) mapping the natural language instructions in the semantic space to the cognitive space of the model and (2) mapping the cognitive ability of the model to the action space of the tool, so as to correctly perform the operation and use the tool. The realization of this goal has profound practical significance:Footnote 1 (1) for one thing, an ideal next generation of human-computer interaction (HCI) will be based on natural language rather than a graphical user interface (GUI). The user only needs to inform the model of the goals that need to be achieved, and the model can perform a series of operations in response; (2) for another, the bar of utilizing complex tools will be greatly lowered. In this sense, any beginner can quickly get started with a new software or tool with the help of the model, making it more convenient to fulfill an intended complex task. However, PTMs trained on general domains are not designed for instruction understanding or tool manipulation by nature. To this end, a potential solution is continual pre-training, which adapts the PTM from the original pre-training domain to the human instruction domain, so as to better grasp the semantics of human instruction. In addition, it is also promising to design knowledge-enhanced tuning methods to improve the PTMs’ semantic understanding of specific domains under the guidance of structured human knowledge.

Cognitive Planning and Reasoning for Complex Tasks

Based on the proper understanding of human instructions, PTMs could form implicit solution chains, i.e., thoughts for complex tasks. This process requires the ability of reasoning and planning for complex tasks. Such an ability has a variety of applications, including theorem proving [68], tool manipulation [137], etc. The recent emergence of chain-of-thought (COT) prompting techniques [130] can be leveraged to further enhance PTMs’ reasoning ability. Through a sequence of intermediate natural language reasoning steps, COT prompting helps PTMs decompose a complex task into relatively simple atomic tasks and solve them one by one. Ultimately, the correct decision-making path can be found to achieve the goal of the user. Another potential solution for complex reasoning is to “learn from experiences.” That is, generalizing the reasoning process of a specific task to form its “thoughts” of planning for other tasks. To achieve this goal, we need to train models to understand how different tasks are intrinsically related, so as to break the barriers between different tasks. In this way, models can learn various tools by analogy. Such a capability is related to the concept in cognitive psychology, that is, human beings generalize a property from one stimulus to another stimulus if both of them are similar in an appropriate psychological space [103].

Integrating Information from the Real World

By interacting with the real world, we may finally gather a series of fragmented information separately. It is of great importance for PTMs to integrate information returned by existing tools into a self-contained and well-organized one. Rendering such organized information to humans completes a closed loop for a cognitive task. Integrating information for PTMs is challenging because newly retrieved information may inherently contradict the original knowledge/belief of PTMs themselves, and it is under-explored how to combine the implicit knowledge of PTMs and the retrieved knowledge from the real world. In fact, recent efforts have been paid to address this challenge. For instance, in open-domain QA, WebGPT [89] and GopherCite [85] are proposed to leverage externally retrieved knowledge to increase the reliability, faithfulness, factuality, and interpretability of the outputs produced by PTMs. Specifically, researchers teach PTMs to learn to interact with reliable IR systems like Microsoft Bing and Google Search, so that the system can retrieve more faithful and relevant documents. After that, PTMs are trained to organize supporting facts into a coherent and self-contained answer. Although many efforts have been spent on integrating textual information from the real world, less is studied about the exploration of other types of information (e.g., graphical information, tabular information, etc.).

To sum up, the ultimate goal of cognitive learning is to move toward the next generation of machine intelligence. Cognitive intelligence will enable PTMs to play a more involved role in all walks of life and interact with the real world on behalf of humans, posing a huge impact on both academia and industry.

14.2.9 P9: Innovative Applications of Big Models

AI is a discipline that emphasizes practical applications and is widely expected to play a role in a broad range of downstream fields and task scenarios. Among these applications, many of them express both immense value and challenges, such as autonomous driving [110], medical assistance [35], etc. Traditional solutions for AI applications can be divided into two main ideas. The first one is to implement symbolic systems driven by human knowledge (like expert systems in the 1980s), while it is difficult to cover all the scenarios encountered in practical applications based on manual rules. The other idea is to conduct data-driven deep learning systems, which still face obstacles due to high labeling costs in various fields that lack sufficient high-quality training data.

The emergence of big models has brought new possibilities for achieving innovative applications. Big models are equipped with a substantial amount of human knowledge, which is scattered in the large-scale unlabelled corpus and can be gained in an unsupervised manner to avoid the high annotation cost. Representative instances for big model applications can be classified into two types: new breakthroughs and new scenarios.

New Breakthroughs

This type refers to the big model systems that achieve surprisingly good performances in already existed application problems. For example, the Critical Assessment of protein Structure Prediction (CASP) challenge has been held for over 20 years, and machine learning systems made just slow progress on this task until the appearance of AlphaFold [57], as we have introduced in Chap. 12). Image generation is also a classical task, while DALLE-2 [94] historically achieves a high-resolution generation that can precisely express the meanings of the given text, providing realistic results that humans can hardly tell whether they are real. Further, DALLE-2 can even imitate paintings of a particular style or even create something that is never seen in the real world.Footnote 2 This greatly inspires and expands the boundaries of artistic creation and has gained a new wave of AI-generated contents (AIGC).

New Scenarios

This refers to the problems that are newly proposed or solved firstly by AI methods. For instance, the characteristic of COVID-19 is a new and significant research topic in recent years. Big models are applied in precision diagnostics, drug repurposing, spread forecasting for Epidemiology, and other problems [105]. Ancient writing research, on the contrary, is an old topic, while AI never played a central role until DeepMind proposes Ithaca [6], which is designed for ancient Greek inscriptions. In this case, the big model can achieve textual restoration, geographical attribution, and chronological attribution. And it helps historians improve their accuracy from 25 to 72% and provide evidence for history and civilization research.

In the above examples, the improvement of parameter scales allows greater knowledge capability and generalization toward various domains, which brings a leap in performance. By observing these success cases, we propose the following two prerequisites that an application scenario can turn to big model systems for help: plenty of domain data and documented domain knowledge.

Plenty of Domain Data

Big models need more data for training (e.g., 650M training images for DALLE-2). Luckily, the requirement for the data form has been quite lower, and the unlabeled/heterogeneous data can be well utilized by big models. Most of the models follow the basic paradigm of pre-training-fine-tuning and can use large-scale unlabeled data to learn the general comprehension ability of basic elements (e.g., words for a language, pixels for an image) by themselves. From there, it is relatively easy to transfer to any specific downstream domain and solve the tasks with as little supervision as possible. For instance, recent works have explored the necessity and advantage of adopting models pre-trained on natural images for medical image processing, especially when the scale of the downstream dataset is small [111]. Besides, researchers also explore large-scale PTMs for domains with versatile formats of data materials, such as the collaborative processing of chemical and natural language as we have introduced in Chap. 12. Nevertheless, after creating a new scenario, there still must be corresponding domain data to unleash the potential of big models.

Documented Domain Knowledge

For fields that humans already have a basic understanding of, the architecture and training strategy of big models can be sophisticatedly designed based on corresponding prior knowledge, and documented knowledge also provides the basic conditions for big models to access and utilize knowledge. In the previous chapters (e.g., Chaps. 9 and 11), we have explained how to conduct knowledge-guided representation learning, such as architecture reformulation and input augmentation methods. In addition, big models have been shown to have behavioral imitation capabilities to access knowledge as human beings. A typical example is WebGPT [89] which can automatically search commonsense and facts to generate more reasonable answers, as we have introduced in cognitive learning. From these examples, we can see that there are more sufficient conditions to realize innovative applications in scenarios with existing domain knowledge bases or ontologies.

Spread the wings of imagination, and we can realize that there are so many fields that big models can dabble in, from sophisticated scientific predictions (such as weather data) to smart home services in our daily life. More innovative applications are waiting for us to explore.

14.2.10 P10: Big Model Systems Accessible to Users

Due to the generalizability of pre-trained models in terms of architecture and capability, big models are expected to become a foundational infrastructure for many information services supported by NLP and AI [9], e.g., search engines, personalized recommendation, and virtual assistants, and domain-specific information organization, e.g., financial, medical, legal, and academic domains.

In particular, recent findings on parameter-efficient delta tuning [30] show that, by keeping a central big model fixed, we can simply design task-specific delta objects to adapt the central model to handle multiple downstream tasks. These breakthroughs indicate a new technique paradigm in NLP, from training a task-specific model for each task separately to stimulating task-specific knowledge scattered in a unified and versatile big model for downstream tasks. Intuitively, with pre-trained big models, our focus is no longer limited to how to learn model parameters for specific tasks but how to stimulate the knowledge of big models to handle specific tasks.

Although the development trend of building unified big models for multiple tasks is clear, it is still not easy for most institutions and individuals to enjoy the power of big models due to the computation and expertise barriers as when have discussed in Chap. 13. We argue that, like the historical successful cases that database management systems (DBMS) are proposed to manage massive data and big data analytics systems (BDS) are proposed to big data mining, it is time for us to build unified management systems of big models, i.e., big model systems (BMS). Similar to DBMS and BDS that store and analyze data in a unified view, we should also design BMS to build and organize big models in a unified view. BMS is expected to provide easy and standardized interfaces for the deployment and application of big models. We should consider the following principles to design BMS accessible to general institutions and individuals.

Data Form and Operation Abstraction of Big Models

Both data form abstraction and operation abstraction enable DBMS and BDS to serve as a standard infrastructure in most companies and organizations. Examples of the data form abstraction are tables in relational DBMS (RDBMS) supported by the relational model of data [23] and resilient distributed datasets (RDD) in the Spark BDS [141]. Examples of the operation abstraction are structured query language (SQL) in RDBMS and the map and reduce functions in the MapReduce BDS [27]. Intuitively, these abstract methods can isolate users and developers of DBMS and BDS. Take DBMS for example: users only consider how to use DBMS to manage data through a series of unified interfaces, without learning how the underlying modules of DBMS that perform data management; developers, by ensuring that the interfaces provided to users remain unchanged, can have more freedom to develop and optimize the underlying modules of DBMS.

We believe big models will also serve as an infrastructure beyond DBMS and BDS for information services. The general-purpose BMS is expected to enable more persons with basic programming skills to use big models. Hence, we should have data form abstraction and operation abstraction specifically designed for big models. BMS relies on data form abstraction to support learning big models from various types of data and provide a unified operation scheme for model manipulation. With the help of prompt learning as a natural language interface between humans and big models [55, 99], we can design high-level and unified programming languages for BMS to manipulate big models and protect big model users from directly interacting with big models by sophisticated deep learning programming.

Efficient Computation and Management of Big Models

BMS should support comprehensive management of big models based on many techniques in above-mentioned topics, such as high-performance computing mentioned in P3, parameter-efficient delta tuning mentioned in P4, and safety mentioned in P6. Since the techniques of big models are still developing rapidly, BMS will actively evolve internally in physical implementation by taking advantage of these advances while keeping user interface stable.

We further argue that, with the novel adaptation technique of delta tuning, BMS should manage and schedule central big models as well as massive task-specific delta objects to support the high concurrency of user requests. Hence, we need to design efficient model scheduling manager (MSM) responsible for storing or distributing big models and delta objects in computing devices. There are many real-world scenarios that should be addressed by MSM, such as continual learning and adaptation of big models, efficient scheduling of multiple big models of various sizes and purposes, fault tolerance that can recover from hardware or network failures, and supporting heterogeneous device architectures such as cloud-edge-terminal cooperation.

In summary, we have shown the broad prospects of big models in the above-mentioned nine key problems, and we need big model systems to turn these prospects into reality accessible to general institutions and individuals. The OpenBMB introduced in Chap. 13 can be regarded as our preliminary attempt at building big model systems. As discussed in this key problem, BMS actually brings many open problems with the deployment of big models in the real world, which requires the collaboration of researchers and practitioners from deep learning and AI, high performance computing, software engineering, networking, and edge/cloud computing. We believe an efficient and effective big model system will play an essential role in making the growing capabilities of AI accessible to everyone.

14.3 Summary

In this chapter, we outlook the future of representation learning standing on the new giants of big models in 2023, as the final chapter of the book. We list ten key problems of big models, including theoretical foundation, next-generation architecture, high-performance computing, parameter-efficient delta tuning, controllable generation, safety and ethics, cross-modality, cognitive learning, innovative applications, and big model systems.

Although the summarized problems may be biased by our research experiences, we still hope they can help readers of the book find your interests. Any suggestions and comments are welcome from our community. Let’s work together on these exciting topics to contribute novel techniques and applications of AI in the future.