Ten Key Problems of Pre-trained Models: An Outlook of Representation Learning

Ding, Ning; Chen, Weize; Zhang, Zhengyan; Hu, Shengding; Cui, Ganqu; Yao, Yuan; Qin, Yujia; Zeng, Zheni; Han, Xu; Liu, Zhiyuan; Lin, Yankai; Sun, Maosong

doi:10.1007/978-981-99-1600-9_14

Ning Ding⁴,
Weize Chen⁴,
Zhengyan Zhang⁴,
Shengding Hu⁴,
Ganqu Cui⁴,
Yuan Yao⁴,
Yujia Qin⁴,
Zheni Zeng⁴,
Xu Han⁴,
Zhiyuan Liu⁴,
Yankai Lin⁵ &
…
Maosong Sun⁴

3210 Accesses

Abstract

The aforementioned representation learning methods have shown their effectiveness in various NLP scenarios and tasks. Large-scale pre-trained language models (i.e., big models) are the state of the art of representation learning for NLP and beyond. With the rapid growth of data scale and the development of computation devices, big models bring us to a new era of AI and NLP. Standing on the new giants of big models, there are many new challenges and opportunities for representation learning. In the last chapter, we will provide a 2023 outlook for the future directions of representation learning techniques for NLP by summarizing ten key open problems for pre-trained models.

The order of the first nine authors is determined according to the order of their corresponding sections.

You have full access to this open access chapter, Download chapter PDF

Pre-trained models for natural language processing: A survey

Article 15 September 2020

Light Pre-Trained Chinese Language Model for NLP Tasks

Exploring Applications of Representation Learning in Nepali

14.1 Pre-trained Models: New Era of Representation Learning

It is about 2 years after the publication of this book’s first edition. The 2 years have witnessed the astonishing rise of large-scale pre-trained models, also well known as big models or foundation models. With the development of pre-trained modeling techniques, representation learning is exhibiting the following remarkable trends.

Unified Architecture of Representation Learning

Ever since the initiative of parallel distributed processing (PDP) in 1980s, hundreds of neural network architectures have been proposed to address various objects, tasks, and domains, with some landmark architectures such as Hopfield networks [48], Boltzmann machines [2], self-organizing map (SOM) [63], recurrent neural networks (RNN), convolutional neural networks (CNN) [70], long short-term memory (LSTM) [47], ResNet [45], and Transformers [118].

As the evolution of neural network techniques, some architectures step down and others emerge. At the early stage of the deep learning era, there were still many architectures specifically designed with different characteristics. For example, at that time, we have to conduct experiments to find which architectures are more suitable for the given NLP task, among CNNs, GRUs, and LSTMs or their variants. After the neural architecture Transformers was proposed in 2017, especially after the pre-trained language models BERT and GPT built on Transformers as the backbone, we can find optimal neural architectures across various domains and tasks are becoming more and more unified from diverse schemes as shown in Chap. 5. The neural architecture Transformers has been the most widely used backbone across almost all NLP tasks, ranging from natural language understanding to generation.

The unifying trend also happens across multiple modalities, and the neural architecture Transformers has shown its power beyond NLP, to CV as shown in Chap. 7, and to other data such as biomedical structures as shown in Chap. 12. The unified architecture across multiple modalities will help to model rich knowledge of cross-modal interaction and further facilitate learning from heterogeneous data.

Of course, the unification process has not been completed, and there is no evidence showing Transformers will be the ultimate neural architecture for representation learning. It will be an important research topic in the future.

Unified Model Capability for Multiple Tasks

With the “pre-training-fine-tuning” pipeline, pre-trained language models also build unified model capability from large-scale unlabeled corpora for multiple downstream tasks. The unified capability of pre-trained models becomes more significant as model parameters grow into a billion scale with more data and more computation power.

The evidence is their unprecedented power in zero-shot learning and few-shot learning as shown in Chap. 5. For example, we can use no more than 1% additional parameters by parameter-efficient delta tuning to adapt big models to specific complicated tasks. It makes us conjecture that big models may have learned all essential knowledge by pre-training from large-scale corpora, and the function of delta tuning is only to inform big models which internal knowledge should be stimulated for the specific downstream tasks.

The unified model capability revealed by big pre-trained models makes them completely different from conventional machine learning approaches including statistical learning and deep learning. It requires the exploration of a new theoretical foundation and efficient optimization methods conditioned on pre-trained big models.

Moreover, with the above-mentioned characteristics of unified model architecture and model capability, we believe pre-trained models to some extent indicate the maturity of distributed representation learning for AI, with great potential for extensive use in each area requiring AI for assistance. It will open a new era of AI and NLP from research to application. Standing on the new giants of big pre-trained models, there are also many new challenges and opportunities for representation learning. Here we summarize ten key open problems for pre-trained models and hope more efforts will be devoted to these problems and promote wide applications of big model techniques.

14.2 Ten Key Problems of Pre-trained Models

In this section, we summarize ten key open problems of pre-trained models, including theoretical foundation, next-generation architecture, high-performance computing, parameter-efficient delta tuning, controllable generation, safety and ethics, cross-modality, cognitive learning, innovative applications, and big model systems.

Note that these open problems are raised based on our research experiences on pre-trained models and deep learning. It does not indicate other problems beyond these ten are not or less important.

14.2.1 P1: Theoretical Foundation of Pre-trained Models

As pre-trained models (PTMs) [9, 42] become the infrastructure of modern NLP, the theoretical principles behind them become exceedingly intriguing to the community. Self-contained and rigorous mathematical theories could efficaciously guide the ameliorations of neural structures, pre-training objectives, and adaptations of PTMs and even pave the road to more powerful artificial intelligence. However, the sad truth is that we are still far from a complete understanding of PTMs. Their mechanism intersects with deep neural networks, transfer learning, and self-supervised learning in an intricate way, and moreover, considerable empirical evidence suggests that the potential of PTMs has not been fully explored.

The specialty of PTMs comes from the universal generalization capability expressed by adaptations to various tasks. Constructed on the basis of deep neural networks (typically deep Transformers [118]), PTMs are firstly pre-trained on massive unsupervised corpora and then adapted to particular downstream tasks. After optimizing a general language modeling objective in the pre-training phase, PTMs are able to yield tremendous generalization capability on a wide range of NLP tasks that involve language data, even with a few examples and a small amount of optimization [30, 50, 71].

In this subsection, we hold the mindset of seekers and discuss the theoretical foundation of the miraculous generalization capability of PTMs by decomposing it into several sub-questions.

What Is the Appropriate Mathematical Description of the Generalization Capability?

When dealing with machine learning and deep learning models, calculus, linear algebra, and probability theory are among the most common choices as tools, while more advanced (and complicated) mathematics are almost untouched at the current stage. This may limit our understanding because real linear and nonlinear operations in the representation space are difficult to inscribe with these tools. Some argue that the probability theory framework that is widely used to describe generative models is intractable in the situation of capturing the correlations of high-dimensional variables [96]. Under this circumstance, other mathematical tools need to be adopted and evaluated to interpret the utilities of neural networks and even PTMs [43, 127]. For example, recent progress in geometric deep learning [11] elaborates different types of neural networks through the lens of symmetry and invariance, bringing new inspiration to the community. There are also works that attempt to provide mathematical frameworks for the revolutionary trigger point, i.e., the Transformer model [34]. Nevertheless, merely attempting to elucidate the neural network architecture may still be insufficient to understand PTMs, and grasping the relationship between pre-training and adaptation is crucial as well.

Why Does Pre-training Bring the Generalization to Downstream Tasks?

Compared to traditional deep learning, the most obvious difference, and the key to success, is the far-flung pre-training phase over numerous data. The simplicity of the pre-training task and the effortlessness of the adaptation to complex tasks urge us to wonder about the principles of how pre-training and adaptation are related. From a vague point of view, PTMs’ colossal capacity makes it possible to induce a type of general knowledge, while adaptation is a process to expose such knowledge [142]. This is, of course, an incomplete and unverifiable explanation, but a series of delta tuning [30] efforts implicitly guided by this insight has yielded remarkable results in a parameter-efficient manner. Taking a closer and simplified look, such knowledge can be modeled as coherence structures in a latent space with the Bayesian framework [128, 134]. Switching to another pragmatic perspective, analyzing the loss landscape may bring new insights into the relationship between pre-training and adaptation [73], where the pre-training phase produces a readily optimizable initialization landscape for PTMs surrounded by local optimums. Modern supervised learning theory aims to explore the bounds of theoretical adaptation loss via empirical adaptation loss and generalization errors. And studies of self-supervised learning borrow from progress in supervised learning theory to bound from the adaptation loss with pre-training loss in certain preconditions [4, 43, 116, 134]. Although analysis of pre-training and adaptation could move our understanding of PTMs one step forward, the special capabilities that come with model scaling take the ultimate goal even further.

How Are the Model Capacity and Capabilities Related?

One of the most fascinating empirical observations of PTMs is the power expressed by merely scaling the size. It is not just a matter of accuracy on standard classification or generation tasks; large PTMs will counter-intuitively emerge with unprecedented capabilities as the number of parameters increases. Models with tens of billions of parameters would give surprising adaptation performance a small number of trainable parameters prepended to the input layer [71]. GPT-3 [13], a model with 175 billion parameters, shows an extraordinary capability of in-context learning, which uses several examples to stimulate the model to imitatively make predictions without tuning a single parameter. Large-scale models could even directly learn from tokenized behaviors of humans to carry out complex tasks such as using search engines [89] and playing sandbox games [7]. Experimental studies indicate that special capabilities of large models do not accumulate linearly but emerge at a certain point [129]. Although such power of scale is verified under different scenarios, it is still hardly framed theoretically.

The success of PTMs could simultaneously attribute to data, objectives, and neural architectures, and it seems to be difficult to separate modules of the process and study them independently without interfering with each other. Overall, the exploration of the theoretical foundation of PTMs is a necessarily arduous journey, whereas any promising conclusions could have profound influences. We encourage the readers of our book to keep an open mind and attempt to apply theoretical tools beyond NLP, machine learning, and even computer science to analyze the behaviors of PTMs and develop corresponding frameworks.

14.2.2 P2: Next-Generation Model Architecture

It has already been 5 years since Transformer was first released. The high capability and ease of parallelism have enabled Transformer-based models to efficiently scale up and achieve near-human or even beyond-human performance on numerous tasks. During these 5 years, we have witnessed the boom of Transformer-based PTMs and the realization of more and more previously unimaginable goals one by one [13, 89]. We have also been a witness to the spreading of Transformer’s territory from NLP to other fields such as computer vision [32, 80], robotics [39, 106], etc. Undoubtedly, Transformer must be one of the most revolutionary model architectures in the history of deep learning.

Despite the power of Transformer-based models, as we have introduced in the first key problem, there is still not yet a sound theory that is able to elucidate the mechanism of Transformer. Besides, Transformer is a data-hungry and resource-intensive architecture, and the problem is further exacerbated as the model size increases [1]. Though Transformer is an epoch-making architecture, we still believe that it will not be the ultimate of neural networks. A natural question we would like to ask is what could be the next-generation architecture for neural networks?

From a historical perspective, we find that many of the earlier breakthroughs in neural networks were inspired by other disciplines. For example, the convolution in CNNs is borrowed from the research on the receptive field in cats’ visual cortex [53], and the memory in LSTMs is also designed to mimic the mechanisms of the human brain. Therefore in this subsection, we will stand at the intersection of different disciplines and focus on neural network architectures that are inspired by different fields. Specifically, we will introduce some architectures inspired by dynamical systems, geometry, and neuroscience. While, at this time, these structures may not be able to outperform Transformer significantly, they all have their own potential and their own strengths that are still worth paying attention to.

Dynamical Systems Inspired Architectures

A dynamical system is a system whose state is evolving over time, e.g., the random motion of particles, where the location of each particle changes over time. Looking at the propagation of hidden states between different layers of a deep neural network, it is intuitive to associate it with a discrete dynamical system by interpreting the layer depth as time step. Indeed, many works have drawn the connection between deep neural networks and discrete dynamical systems described by ordinary differential equations (ODEs) [82, 132]. The hidden state propagation in ResNet [45] exactly resembles the forward Euler discretization of an ODE. Therefore, the computation in ResNet can be seen as implicitly solving an ODE defined by the model parameters. Apart from the dynamical systems described by ODEs, the dynamical systems described by controlled differential equations (CDEs) [17, 101] and stochastic differential equations (SDEs) [61, 76] are also shown to be closely related to neural networks.

A number of advantages stem from the dynamical system perspective of neural networks. Examples are as follows:

(1)
GPU memory efficiency. By introducing the adjoint state method [15] in the numerical optimization problem, the GPU memory consumption can be reduced from \(\mathcal {O}(L)\) for ResNet (L denotes the number of layers) to \(\mathcal {O}(1)\) [20, 76].
(2)
Adaptive computational time. Ideally, models should spend less time on simple samples and more time on complex ones. However, current architectures treat the instances with different complexity equally. By leveraging the adaptive step-size solvers in numerical optimization literature, models can have adaptive time costs for different instances [18, 37].

Through the perspective of dynamical systems, neural networks can be naturally generalized to continuous systems, and plentiful theories in the dynamical systems can step in to inspire new designs for neural networks. We believe it is a promising area to explore.

Geometry Inspired Architectures

Humans live in a Euclidean world. Therefore, we naturally accept the assumption that the geometry of the neural networks should also be Euclidean. However, this is not the case, as the data that the neural networks handle differs from what we are exposed to. Many complex data, such as graph data, have been shown to exhibit non-Euclidean properties [12]. Intuitively, when the neural networks are also non-Euclidean, they should be able to handle the data better due to the matching of the geometry.

Considering the non-Euclidean geometries in the neural networks brings several benefits: (1) Greater capability in modeling structured features both theoretically and empirically. Many real-life graphs are known to be tree-like. However, even when the dimension of the Euclidean space is unbounded, tree structures still cannot be embedded with arbitrarily low distortion, i.e., some information will always be lost. However, it can be easily achieved in a two-dimensional hyperbolic space, which is a non-Euclidean space [102]. And in practice, there have also been a lot of graph-related works demonstrating the effectiveness of low-dimensional hyperbolic models [16, 22, 91]. (2) Combinability with the dynamical system. Geometry can also collaborate with the dynamical system we mentioned above. From the perspective of geometry, the layers in neural networks can be seen as transformations on the coordinate representation of the data manifold. From the perspective of the dynamical system, the depth of neural networks can be continuous. When combined, it is possible to give a continuous transformation process from the data manifold to the final linearly separable manifold for different classes [12, 84]. It has the potential to provide a more intuitive understanding of how neural networks gradually transform the data from input features to features that can eventually be used for classification.

In all, non-Euclidean geometry offers a prominent direction for neural networks. It is a promising approach to handle the structured data and to combine with other perspectives to offer better insight into neural networks.

Neuroscience Inspired Architectures

When thinking, unlike neural networks, we don’t need to consume large amounts of energy, nor do we spike our brain temperature to near 100 ^∘C. Although still called neural networks, today’s artificial neural networks (ANNs) have already become much more energy-hungry and resource-demanding than the human nervous system. Compared to ANNs, the sparsity of human brains allows them to consume much less energy than ANNs. Therefore, inspired by the sparsity of neuronal interconnections in the human brain, researchers have experimented with designing neural networks with sparsity from two dimensions: spatial sparsity and temporal sparsity.

The human brain has sparse neuronal connections and relatively distinct functional partitions. That is, neuronal connections in the human brain are spatially sparse. This allows us to accomplish a simple task without using neurons from the whole brain. Inspired by the spatial sparsity, the mixture of experts (MoE) structure is proposed [33, 54]. Unlike conventional neural networks which are densely connected, MoE divides each layer into several experts and additionally includes a router to route every input to only a few experts. Since not all the experts are involved in the computation, the inference can be faster than densely connected networks. The advantage of MoE models in terms of computational cost allows them to scale up very efficiently. In addition, because different inputs will be processed by different experts, ideally, different experts can learn to handle different aspects of a task (or even multiple tasks), making it suitable for artificial general intelligence. Indeed, MoE models have been shown to reach state of the art on several benchmarks with fewer computational cost [88].

In addition to spatial sparsity, the human brain also exhibits temporal sparsity, i.e., neurons do not transmit signals every time step. Spike neural networks (SNNs) [38] mimic the behavior of information propagation between neurons interconnected by synapses. When the pre-synaptic neuron is activated, it sends a signal in the form of synaptic current to the post-synaptic neuron, and the current strength is proportional to the weight of the synapse. The incoming synaptic currents change the membrane potential of the post-synaptic neuron, and when the membrane potential reaches a certain threshold, the post-synaptic neuron emits a spike, and its membrane potential is reset to its resting potential. The biggest advantage brought by the SNNs is the extremely low energy consumption. Because SNNs only consume energy when emitting spikes, the energy consumption of SNNs can be extremely low compared to mainstream neural networks [60, 87, 112]. Also, neuromorphic, which is the specialized hardware for SNN, allows both computation and parameter storage on the same chip, further boosting the efficiency [26]. Although the performances of SNNs are often slightly lower than the mainstream neural networks on datasets such as MNIST [70] and CIFAR-10 [66], the low energy characteristic of SNN makes it promising for the future.

Looking back at history, in 2012 AlexNet [67] was proposed, and since then deep neural networks such as CNNs and RNNs take the lead in machine learning. Five years later in 2017, Transformer was introduced and gradually replaced the models such as RNNs. Now, in 2022, we are celebrating another 5-year period, wondering what could be the next-generation neural network. We believe Transformer will not be the ultimate of the neural networks, and we are eager to see more researchers think about and explore the next-generation neural network architecture and propose more economical, more efficient, and more effective models.

14.2.3 P3: High-Performance Computing of Big Models

Numerous parameters of big models come with exceedingly expensive computation and storage costs, imposing substantial challenges on both training and inference. In fact, improving the computational efficiency of big models is a complicated process in which many fundamental aspects should be considered. In particular, meliorations across the computational infrastructure, algorithms, and specific applications can be simultaneously conducted. In this subsection, we discuss high-performance computing of big models from these three perspectives.

High-Performance Computational Infrastructure

We collectively refer to the hardware and software as the computational infrastructure, which is the foundation for both the training and inference of big models and even deep neural networks. In general, high-performance computational infrastructure can be further exploited from the following directions: (1) Parallel computing methods, including data parallelism [113], tensor parallelism [52, 90], pipeline parallelism [104], and hybrid parallelism [95], could fully utilize distributed computing capabilities to accelerate the computation of big models. (2) We should take advantage of heterogeneous computing devices [56], including multi-level computing devices consisting of GPUs and CPUs, and multi-level storage devices consisting of VRAMs, RAMs, and disks, to reduce the computing cost while ensuring the computing efficiency. (3) Considering big models have large-scale parameters, we should investigate techniques to reduce the memory overhead, including tensor offloading [100, 107] and tensor rematerialization [21, 62], facilitating us to compute bigger models using fewer computing devices. (4) Moreover, high-performance tensor programs [122] are also critical for making deploying big models efficient, especially sparse tensor programs [149] considering the sparsity of neural networks.

High-Performance Algorithms

Existing work on big models enjoys the emergent ability that comes with increasing parameters while ignoring the efficiency of the parameter utilization. If we draw an analogy between big models and the brain, we will find that the brain consumes much less cost for the similar billions of parameters (neurons) due to some enigmatic mechanism brought by evolution. Recently, two Turing Award Winners, Yoshua Bengio and Yann LeCun, also highlight the importance of neuroscience for AI [140], and they believe that the next generation of AI will be largely driven by neuroscience. Hence, it is a promising way to design new algorithms by utilizing knowledge of neuroscience. We will discuss several important brain-inspired mechanisms as examples and hope these methods can inspire more explorations. (1) Learning from memory mechanisms of human brains [115]. We should build an explicit memory system to store the information and retrieve relevant pieces for a given input instead of computing all parameters [40, 72]. (2) Learning from System 1 and System 2 of human brains [25]. We should design a system that can automatically switch between the fast and the accurate modes for inputs with different levels of complexity [135]. (3) Inspired by recent work highlighting the importance of cooperation between brain regions [114], we should also explore how to compose multiple big models to achieve better performance [3], which is more efficient than training a bigger model from scratch.

High-Performance Application

When dealing with limited resources of edge devices such as mobile phones, our approach should shift from squeezing the performance out of computing devices to compressing the big models themselves for efficient deployment. As introduced in Chap. 5, there are many compression techniques, such as knowledge distillation [46] and parameter pruning [41], that could compress big models to acceptable scales. Overall, in terms of high-performance applications, we believe the following future directions show considerable potential. (1) Computing hardware sets boundaries for our compression techniques. To this end, properties of application hardware must be considered to find the best compression architecture with minimal latency [121] or energy costs [125] rather than FLOPs. (2) Different downstream tasks may exhibit different characteristics, thereby requiring compression strategies with disparate focuses. We should explore task-aware compression to utilize the specific patterns of different tasks, such as vocabulary reconstruction [136] for tasks of a specific language and decoder-oriented compression for generation tasks [77]. (3) Many compression approaches could achieve similar results but are orthogonal in technical aspects. To this end, we could take advantage of multiple compression techniques to achieve higher compression ratios. Some preliminary works have begun to investigate combinational compression and have already achieved some promising results [148]. However, how to combine all existing methods to achieve optimal inference acceleration within an acceptable performance degradation still remains an open problem.

The development of high-performance computing is an important driving force for deep learning, especially for big pre-trained models. In the past, the performance gains have mainly come from the growth of computing power. In the future, we need to devote more efforts on how to improve the utilization of computing power. On the one hand, it can lower both the bar of using big models for anyone who is interested in AI and the carbon footprint of computing big models. On the other hand, in the post-Moore era, there is limited room for further improvement in computing power, and new methods should shift from relying on the growth of computing power to improving efficiency.

14.2.4 P4: Effective and Efficient Adaptation

Before the arrival of the era of PTMs, empirical improvements of NLP applications are primarily achieved by considerations across aspects of models, algorithms, task-specific characteristics, etc. After PTMs take the stage, researchers find that prominent advancements in almost all NLP tasks can be delivered by merely scaling up PTMs. Such a success of scaling, despite elusive, has fueled a surge of development of big models with billions [93] and even hundreds of billions of parameters [13]. Accordingly, the emergence of big models triggers provoking explorations of advanced model adaptations, which suggests that the full-parameter fine-tuning approach used in early PTMs is not the optimal solution for model adaptation. It is neither effective across all forms of datasets nor economically efficient on common computation devices. That is to say, the inherent characteristics of the big model itself must be taken into account, and innovative strategies for model adaptations should be established. To this end, how to effectively and efficiently adapt big models becomes a pivotal research issue. The problem is threefold in this subsection, including computationally practical adaptation, task-wise effective adaptation, and advanced adaptation with complex reasoning.

Computationally Practical Adaptation

The huge size of big models is a blessing in terms of experimental performance, whereas a curse in terms of the adaptation process. Deploying and adapting these models to assorted tasks require considerable computational and storage resources that are prohibitive to common researchers. Instead of updating all the parameters of big models, recent studies of delta tuning [30, 49, 50, 75] find that only a tiny portion of parameters could yield comparable or even better performance of full-parameter fine-tuning. These trainable parameters can be represented as different structures or in different positions in big models. But a consistent empirical characteristic is that the larger the model, the better the performance of this paradigm. Delta tuning reifies conceptual capabilities to solve particular tasks in a concrete and lightweight manner. The resulting lightweight delta objects are easy to store and share across tasks and users, imposing considerable maneuverability on big models and unleashing the imagination of the industrialized use of these behemoths. Despite the efficiency, there are dark clouds still hanging over this topic. For example, it is difficult to assess the optimal amount of tunable parameters for different tasks, and the convergence of delta tuning is relatively slower than full-parameter fine-tuning. In addition, the theoretical principles behind the success of delta tuning can also help the community further understand big models. The revolution in terms of model adaptations does not only occur at the parameter optimization level but also at the level of data and tasks. Next, we take prompt learning as a landing point to discuss the task-wise effective adaptation of big models.

Task-Wise Effective Adaptation

Taking BERT [29] as an example, PTMs in the early stage first produce representations for current inputs and adopt extra classifiers to carry out adaptations to downstream tasks. This seemingly established approach may actually be counter-intuitive since there is a considerable chasm between pre-training and adaptations. Empirical evidence shows that inserting additional contexts, i.e., prompts, and transferring downstream tasks to pre-training tasks could substantially shrink the gap and yield promising performance, especially in the low-data regimes. Prompts could be generated and constructed by different means and forms, but fundamentally, this technique implies a trend of unification of NLP tasks, which includes the unification of pre-training tasks and downstream tasks, as well as the unification between different downstream tasks. Prompt learning has shown intriguing attributes such as zero- and few-shot learning, task generalization, and structural unification of datasets. Besides, the flexibility of prompts makes it possible to smooth the logic chain of big models and stimulate complex reasoning capabilities.

Advanced Adaptation with Complex Reasoning

The reasoning capability of big models has been a long-standing debate that no one can perfectly arbitrate, where the existence, representation, and stimulation methods have been suspending research questions for years. Intuitively for human beings, solving more complex questions is almost equivalent to more comprehensive reasoning ability. When it comes to big models or, more generally, neural networks, continuous studies about shortcuts and record-breaking performance of complex tasks create a confrontational situation. With no intention of philosophizing the argument, we look at this only from the perspective of performing complex tasks, where big models could produce striking logical processes in numerical and commonsense reasoning tasks [130]. Consistent with the aforementioned two points of computationally practical and task-wise effective adaptations, such reasoning capabilities emerge at a certain point of model scaling, which implies that models should have sufficient capacity and be trained on sufficient data in pre-training to elicit complex reasoning. However, such reasoning abilities to perform complex tasks are not stable in practice, where they show different variances for different data and are extremely demanding in terms of stimulation manners. This puts researchers in the awkward position that we are all vaguely aware of the enormous potential of big models, but have few clues about how to hit that upper limit.

In summary, research considerations of big model adaptations could be encapsulated in three points according to the above statements: First, big models should be computationally practical so that they can fully replace previous approaches when their training and storage are no longer an unattainable goal for the community. Delta tuning is a highly prospective attempt at the algorithmic level, and perhaps the community also needs to make efforts on computational systems and hardware. Second, the predictive power of big models could be realized by new types of data and task organization, and prompt learning is the product brought by the development of big models, which also pushes us to adopt a more unified perspective when looking at the tasks. Finally, to further tap the potential of big models, complex reasoning must be explored, and this is a key step for artificial intelligence to enter the cognitive level instead of making simple predictions.

14.2.5 P5: Controllable Generation with Pre-trained Models

Generating data distribution is a long-standing challenge for the machine learning community due to its inherent high dimensionality and intractability. Fortunately, the unprecedented capabilities accompanied by PTMs have brought this goal within reach and thus sparked a new surge of research. In empirical inspections of large-scale PTMs, researchers have discovered their impressive ability to generate high-quality text [13], images [94], videos [108], or programming codes [19]. However, PTMs are black boxes, which make us passively accept the generated results rather than actively controlling the model to produce contents that match a specific requirement. How to precisely introduce conditional constraints to control the generated results poses a major challenge for PTMs. Specifically, the challenge of controllable generation comes from three facets: a unified framework for diverse controls, the compositionality of controls, as well as a well-recognized evaluation benchmark.

A Unified Framework for Diverse Controls

The primary objective of controlled generation is to meet the diverse practical desires of users concerning content, features, and styles. Diverse controls result in dispersed research efforts. For example, depending on the category of the input, separate models are trained for generation from paragraphs [36], dialogues [145], tables [109], etc. Regarding the properties of the generated text, requirements for sentiment orientation [51] or keyword satisfaction [147] are accomplished by distributional change or insertion-based methods, respectively. In spite of the proliferation of works on diverse controls, we would prefer to use a unified framework to accomplish all these controls rather than designing specific methods to meet each requirement. A unified framework can not only encourage research to be iterated rapidly and convergently but also enable the investigation of the relatedness and combinatoriality of diverse controls. Recently, there have been several research works in this direction: (1) Prompt-based methods. Either by injecting a control code [58] or continuous parameters [75], we can leverage the same PTM with diverse controls. The major drawback is that prompt-based methods usually have coarser control granularity or smaller control power, thus incapable of handling hard constraint tasks like copying a span of text. (2) Distribution modification methods. By incorporating different constraints in the decoding stage of the language model [78], the generated text from the same PTM can be steered from different directions. Its limitation is that distribution modification methods may hinder the fluency of generation [59]. Hence, how to combine the two approaches or develop novel approaches for unification are still open questions.

Compositionality of Controls

In addition to the diversity, controllable generation is also expected to be multidimensional and multi-grained to allow more intricate combinations of controls. As discussed in Chap. 3, compositionality, which studies how to use low-level linguistic units to form high-level semantics, is a topic of considerable interest in text representations [86] and natural language understanding. It is less explored in the context of controllable generation due to the dispersal of control approaches. To this end, the advocates of a unified framework for generation can contribute to compositionality. To steer the generation toward multiple control requirements simultaneously, combining prompts with individual functionalities can be explored to form more comprehensive capabilities [92]. Nonetheless, the exploration is still primitive, with the simple concatenation of prompts as the composition method. As yet, we do not have an understanding of the internal mechanism of controllable generation for PTMs, making it difficult to develop advanced compositional control methods. Of course, we also look forward to other novel approaches that can achieve compositionality of controls.

Well-Recognized Evaluation Benchmark

As ImageNet [28] in computer vision and GLUE Benchmark [120] in natural language understanding have demonstrated, a recognized benchmark can foster benign competition among researchers and identify promising approaches. However, such a benchmark is absent for generation tasks, especially controllable generation. The problem is further compounded by the fact that researchers may use different assessment methods and different data when focusing on the same aspects of controllability [64, 78]. We highlight three aspects of the difficulty of establishing a benchmark for controlled generation and the potential improvements. (1) Firstly, human language is rich in expressions, and the same meaning can take on many nuances. So any golden answer is not sufficient. A possible solution is to create semantic matches between utterances. This requires a powerful semantic understanding model that can provide reliable matching scores from diverse angles. The previous works, e.g., BERT-Score [146], are still insufficient in this regard. Whether the large PTMs like GPT-3 could be used to provide powerful semantic matching is still an open problem. (2) Secondly, control requirements are intractable and diverse. For example, topic satisfaction or emotional tendencies are difficult to measure quantitatively. Considering the diversity issue, how to integrate the criteria into a unified implementation that can be used across the community is a complicated but urgent task. (3) Thirdly, evaluation should take into account potential degraded factors such as quality and efficiency. Some works [78] point out that there is an inevitable trade-off between the control’s satisfaction rate and text quality. Additionally, either increasing the length of input via prompts or applying complex decoding strategies will sacrifice generation efficiency, which should also be taken into consideration for a well-rounded evaluation. Due to the aforementioned challenges, few attempts have been made to unify the evaluation, and a universally recognized benchmark is still urgently needed.

Controllable generation is important in all areas of AI. The approaches to controllable generation are not unified across tasks, and this in turn leads to difficulties in compositionality of various control approaches. Further, the challenge of controllable generation is exacerbated by the lack of a well-recognized evaluation benchmark. Advances in the above three directions will greatly contribute to the controllability of generation and thus make generation techniques better serve practical needs.

14.2.6 P6: Safe and Ethical Big Models

With the exciting progress made in recent years, big models are deemed as cornerstones of modern NLP as well as AI. However, responsible AI research calls for clear recognition of both benefits and risks. While the benefits of big models are under extensive exploration, we should also be concerned about the underlying negative impacts and harms to individuals and society before deploying big models in the real world. In Chap. 8, we have discussed the robustness requirements for NLP models, and most topics are related to model safety or ethics. Although considerable efforts have been devoted, there still remain major challenges to solve and possible future directions to explore. In this section, we discuss open problems toward safe and ethical big models from the perspective of evaluation, governance, and construction.

Evaluating Safety and Ethical Levels

The very first challenge in building safe and ethical big models is how to conduct rigorous and comprehensive evaluations. For model safety, we have introduced several essential threats against NLP models in Chap. 8, including backdoor attacks, adversarial attacks, and distribution shifts. However, a golden standard of model safety has not been reached, which means we still have no comprehensive safety evaluations. As the deployed models are continually exposed to complex external environments, there are emerging risks, and we wonder if the models are robust to such risks. Tramer et al. [117] figure out that the majority of adversarial defense methods fail to work when attackers adapt their attack strategies accordingly. This suggests that safety over known threats is not enough, and the underlying unknown threats should also be taken into consideration.

Measuring the ethical level of models is even more complicated. It is observed that big models could generate stereotypical or hateful comments about certain groups of people [131], disseminate false or misleading information [144], and leak private information from training data [14]. Obviously, these behaviors violate human values and thus are undesirable. However, it is easy to find individual cases, but rather difficult to conduct rigorous measurements since the human values are hard to specify. Given the social and regional diversity, there does not exist a static and universal rule to assess ethical levels. Worse still, values about politics, religions, and ethnicity are always conflicted across groups, making it even harder to evaluate. Under such conditions, datasets and benchmarks in this research field need to be carefully checked for valid measurement. We also suggest researchers cooperate with sociologists to gain theoretical insights.

Governing Big Models

Given the potential safety and ethical risks of big models, how to cooperate correctly with big models is an essential problem for the AI community, which is referred to as model governance. However, big model governance is challenging both technically and non-technically. On the technical side, big models are capable of completing various downstream tasks via simple adaptation, which also include harmful ones such as generating offensive speech or fake news. Due to the black-box architecture of big models, finding and disabling these harmful functionalities can be difficult. Although practitioners adopt some effective approaches like keyword filters, they cannot guarantee the models are fully governed [119], leaving this problem open for future research. On the non-technical side, model governance is not only about the research community but also about achieving principles and laws across model providers and users, which requires multi-party cooperation. We are glad to see that some responsible organizations are contributing in this area [24] and appeal to more researchers to help advance this important direction.

Building Inherently Safe Models

Another fundamental question about model safety is how can we learn inherently safe models? In Chap. 8, we introduce approaches to solve robustness issues, but most methods we mentioned are targeted at specific problems except pre-training. However, while it has been widely acknowledged that bigger models may make fewer mistakes, we still argue that scaling models and data sizes is not the elixir to eliminate safety problems because an inherently safe model does not equal a model making no mistakes. Instead, to achieve human-level robustness, the models should (1) know what they know and do not know (i.e., calibrated) and (2) learn from mistakes and correct themselves [69, 83]. In this regard, current big models are still far from inherently safe, and we hope to see more efforts devoted to this fundamental problem. Toward inherently safe models, we figure out two possible directions. (1) Incorporating knowledge. In Chap. 9, we see the remarkable success made by injecting knowledge into PTMs. On model safety, incorporating knowledge can help as well. For example, models won’t be fooled by “U r stupid!” if they possess phonetic knowledge. Hence, we recognize building knowledgeable big models as a reliable approach for model safety. (2) Cognitive learning. Nowadays learning paradigm for big models is still data-driven, which cannot fully reflect the underlying risks in the real world. Different from models, we human beings can actively interact with the world and consistently gain knowledge. Moreover, we also largely benefit from the “trial and error” process and learn how to avoid mistakes. Therefore, we address the importance of learning from cognition and interaction for building safe models [65], and we further elaborate on this topic in Sect. 14.2.8.

Safety and ethics are two long-standing topics in AI, which are even extensively discussed in literature and artworks (e.g., Isaac Asimov’s “Three Laws of Robotics” [5]). In the worry of runaway powerful machines, we present several key challenges and future directions for this open problem. We stress that, in the context of nowadays AI hype, we researchers especially need careful consideration before we take every single step and take responsibility for the healthy development of big models.

14.2.7 P7: Cross-Modal Computation

Building intelligent agents that can think and behave like humans is a long-standing goal of AI. An important and appealing characteristic of human intelligence is the impressive capability of perceiving and handling information from different modalities. Recently PTMs have greatly pushed forward the development of intelligent agents in single modalities (such as text [29], image [44], and audio [31]) and also led to breakthroughs in cross-modal computation. By exploiting self-supervised signals in large-scale cross-modal data, generic representations connecting different modalities can be effectively pre-trained and transferred to facilitate various downstream tasks. Cross-modal PTMs based on the pre-training-fine-tuning paradigm seem to constitute a promising foundation to realize such cross-modal intelligence. To this end, we discuss several promising directions for advancing cross-modal PTMs in this subsection, including big cross-modal models with efficient pre-training and adaptation, more unified representation with more modalities, and embodied cross-modal reasoning and cognition.

Big Cross-Modal Models with Efficient Pre-training and Adaptation

Existing works show that impressive capabilities can emerge in pre-trained language models when the model capacity (e.g., number of parameters) substantially scales up. For example, the 175B GPT-3 is able to perform in-context few-shot learning and chain-of-thought prompting for complex tasks. However, although cross-modal pre-training on deep Transformers has pushed forward the state of the art of various tasks, compared with language models, cross-modal models are typically limited in parameter sizes. This hinders the exploration of cross-modal PTMs to more advanced capabilities and tasks. An important reason is that compared with big language models, it can be even more expensive to pre-train and adapt big models that deal with multiple modalities. Some works have explored more efficient pre-training by reusing unimodal models that have been well pre-trained and focusing on connecting PTMs from different modalities [3]. Some works have investigated the efficient adaptation of vision-language models in terms of both data [3, 126, 139] and parameters [150]. In the future, more efforts can be devoted to efficient pre-training and adaptation of big cross-modal representation learning models.

More Unified Representation with More Modalities

Traditional cross-modal works typically design highly specialized model architectures to maximally exploit the inductive bias of modalities and tasks. For example, RNNs are designed to model the sequential dependency of text, and CNNs are developed to model the shift and scale invariance of images. The learning signals usually come from the human annotation of specific tasks. However, designing specific model architectures and learning signals for different modalities and tasks requires extensive expert knowledge, and it can be problematic to maintain a model for each of the large number of tasks. With the development of deep cross-modal pre-training, cross-modal representation learning models are becoming more unified in terms of model architectures and learning mechanisms [74, 138]. Most recently, some works have shown promising results in using unified model architectures, parameters, and learning mechanisms for unimodal, cross-modal, and embodied tasks [97, 123, 124]. Some works have explored pre-training with more modalities, including text, image, and audio [79]. In the future, building a unified representation learning model that can simultaneously deal with various modalities and tasks will be a promising foundation and path to realizing general intelligent systems.

Embodied Cross-Modal Reasoning and Cognition

Semantic recognition capability has been extensively investigated in different modalities, e.g., named entity detection from text and object detection from images. For more complex reasoning and cognition capabilities, obstacles have been encountered in different ways: (1) For modalities with low information density, such as images and audios, semantic recognition can already be a challenging task [98], let alone more complex reasoning [143]. (2) For text which has high information density, it can be more natural to perform complex reasoning based on the abstract symbolic tokens, and recently big language models have shown promising results in commonsense and mathematical reasoning [130]. However, many AI researchers believe that true recognition capability cannot arise from learning only from text [8]. Research in cognitive science also shows that the human mind is highly shaped by embodied learning [133]. Therefore, a more promising direction will be an embodied cross-modal reasoning model. The concrete signals from other modalities can be effectively aggregated into a text-based central unit for high-level semantic reasoning. Some attempts have been made [10], and we believe that the direction is worth more exploration.

In summary, as an important interdisciplinary area that connects information in different modalities, cross-modal computation is essential and beneficial to various real-world AI applications and is also one of the key problems to more general intelligent systems. With their recent rapid development, cross-modal PTMs have become a new foundation in advancing toward this goal. We believe that developing an efficient big cross-modal PTM that can deal with various complex embodied reasoning tasks in a unified fashion will be a promising direction.

14.2.8 P8: Cognitive Learning

An essential measurement of general AI is whether neural models can correctly perceive, understand, and interact with the world, i.e., the cognitive ability. A prototype of general intelligence can be viewed as the capability of manipulating existing tools (e.g., search engines, databases, web-side mail systems, etc.), conducting cognitive planning with complex reasoning, and interacting with the real world to acquire and organize information/knowledge.

Serving as the foundation for AI, PTMs have pushed state-of-the-art performance in a variety of downstream tasks. The rich language knowledge, world knowledge, and commonsense knowledge stored in PTMs determine their unique advantages in cognitive modeling. Efficiently utilizing such knowledge conduces to stimulating the cognitive ability of PTMs, based on which PTMs could effectively interact with the real world in complex scenes. Despite the great success, current PTMs still cannot handle advanced cognitive tasks. To bring PTMs human-level cognitive intelligence, we identify three core challenges for achieving general cognitive intelligence:

Understanding Human Instructions and Interacting with Tools

How could PTMs better understand the user’s instructions and interact with existing tools to complete a specific task? Fulfilling this goal requires precisely (1) mapping the natural language instructions in the semantic space to the cognitive space of the model and (2) mapping the cognitive ability of the model to the action space of the tool, so as to correctly perform the operation and use the tool. The realization of this goal has profound practical significance:^{Footnote 1} (1) for one thing, an ideal next generation of human-computer interaction (HCI) will be based on natural language rather than a graphical user interface (GUI). The user only needs to inform the model of the goals that need to be achieved, and the model can perform a series of operations in response; (2) for another, the bar of utilizing complex tools will be greatly lowered. In this sense, any beginner can quickly get started with a new software or tool with the help of the model, making it more convenient to fulfill an intended complex task. However, PTMs trained on general domains are not designed for instruction understanding or tool manipulation by nature. To this end, a potential solution is continual pre-training, which adapts the PTM from the original pre-training domain to the human instruction domain, so as to better grasp the semantics of human instruction. In addition, it is also promising to design knowledge-enhanced tuning methods to improve the PTMs’ semantic understanding of specific domains under the guidance of structured human knowledge.

Cognitive Planning and Reasoning for Complex Tasks

Based on the proper understanding of human instructions, PTMs could form implicit solution chains, i.e., thoughts for complex tasks. This process requires the ability of reasoning and planning for complex tasks. Such an ability has a variety of applications, including theorem proving [68], tool manipulation [137], etc. The recent emergence of chain-of-thought (COT) prompting techniques [130] can be leveraged to further enhance PTMs’ reasoning ability. Through a sequence of intermediate natural language reasoning steps, COT prompting helps PTMs decompose a complex task into relatively simple atomic tasks and solve them one by one. Ultimately, the correct decision-making path can be found to achieve the goal of the user. Another potential solution for complex reasoning is to “learn from experiences.” That is, generalizing the reasoning process of a specific task to form its “thoughts” of planning for other tasks. To achieve this goal, we need to train models to understand how different tasks are intrinsically related, so as to break the barriers between different tasks. In this way, models can learn various tools by analogy. Such a capability is related to the concept in cognitive psychology, that is, human beings generalize a property from one stimulus to another stimulus if both of them are similar in an appropriate psychological space [103].

Integrating Information from the Real World

By interacting with the real world, we may finally gather a series of fragmented information separately. It is of great importance for PTMs to integrate information returned by existing tools into a self-contained and well-organized one. Rendering such organized information to humans completes a closed loop for a cognitive task. Integrating information for PTMs is challenging because newly retrieved information may inherently contradict the original knowledge/belief of PTMs themselves, and it is under-explored how to combine the implicit knowledge of PTMs and the retrieved knowledge from the real world. In fact, recent efforts have been paid to address this challenge. For instance, in open-domain QA, WebGPT [89] and GopherCite [85] are proposed to leverage externally retrieved knowledge to increase the reliability, faithfulness, factuality, and interpretability of the outputs produced by PTMs. Specifically, researchers teach PTMs to learn to interact with reliable IR systems like Microsoft Bing and Google Search, so that the system can retrieve more faithful and relevant documents. After that, PTMs are trained to organize supporting facts into a coherent and self-contained answer. Although many efforts have been spent on integrating textual information from the real world, less is studied about the exploration of other types of information (e.g., graphical information, tabular information, etc.).

To sum up, the ultimate goal of cognitive learning is to move toward the next generation of machine intelligence. Cognitive intelligence will enable PTMs to play a more involved role in all walks of life and interact with the real world on behalf of humans, posing a huge impact on both academia and industry.

14.2.9 P9: Innovative Applications of Big Models

AI is a discipline that emphasizes practical applications and is widely expected to play a role in a broad range of downstream fields and task scenarios. Among these applications, many of them express both immense value and challenges, such as autonomous driving [110], medical assistance [35], etc. Traditional solutions for AI applications can be divided into two main ideas. The first one is to implement symbolic systems driven by human knowledge (like expert systems in the 1980s), while it is difficult to cover all the scenarios encountered in practical applications based on manual rules. The other idea is to conduct data-driven deep learning systems, which still face obstacles due to high labeling costs in various fields that lack sufficient high-quality training data.

The emergence of big models has brought new possibilities for achieving innovative applications. Big models are equipped with a substantial amount of human knowledge, which is scattered in the large-scale unlabelled corpus and can be gained in an unsupervised manner to avoid the high annotation cost. Representative instances for big model applications can be classified into two types: new breakthroughs and new scenarios.

New Breakthroughs

This type refers to the big model systems that achieve surprisingly good performances in already existed application problems. For example, the Critical Assessment of protein Structure Prediction (CASP) challenge has been held for over 20 years, and machine learning systems made just slow progress on this task until the appearance of AlphaFold [57], as we have introduced in Chap. 12). Image generation is also a classical task, while DALLE-2 [94] historically achieves a high-resolution generation that can precisely express the meanings of the given text, providing realistic results that humans can hardly tell whether they are real. Further, DALLE-2 can even imitate paintings of a particular style or even create something that is never seen in the real world.^{Footnote 2} This greatly inspires and expands the boundaries of artistic creation and has gained a new wave of AI-generated contents (AIGC).

New Scenarios

This refers to the problems that are newly proposed or solved firstly by AI methods. For instance, the characteristic of COVID-19 is a new and significant research topic in recent years. Big models are applied in precision diagnostics, drug repurposing, spread forecasting for Epidemiology, and other problems [105]. Ancient writing research, on the contrary, is an old topic, while AI never played a central role until DeepMind proposes Ithaca [6], which is designed for ancient Greek inscriptions. In this case, the big model can achieve textual restoration, geographical attribution, and chronological attribution. And it helps historians improve their accuracy from 25 to 72% and provide evidence for history and civilization research.

In the above examples, the improvement of parameter scales allows greater knowledge capability and generalization toward various domains, which brings a leap in performance. By observing these success cases, we propose the following two prerequisites that an application scenario can turn to big model systems for help: plenty of domain data and documented domain knowledge.

Plenty of Domain Data

Big models need more data for training (e.g., 650M training images for DALLE-2). Luckily, the requirement for the data form has been quite lower, and the unlabeled/heterogeneous data can be well utilized by big models. Most of the models follow the basic paradigm of pre-training-fine-tuning and can use large-scale unlabeled data to learn the general comprehension ability of basic elements (e.g., words for a language, pixels for an image) by themselves. From there, it is relatively easy to transfer to any specific downstream domain and solve the tasks with as little supervision as possible. For instance, recent works have explored the necessity and advantage of adopting models pre-trained on natural images for medical image processing, especially when the scale of the downstream dataset is small [111]. Besides, researchers also explore large-scale PTMs for domains with versatile formats of data materials, such as the collaborative processing of chemical and natural language as we have introduced in Chap. 12. Nevertheless, after creating a new scenario, there still must be corresponding domain data to unleash the potential of big models.

Documented Domain Knowledge

For fields that humans already have a basic understanding of, the architecture and training strategy of big models can be sophisticatedly designed based on corresponding prior knowledge, and documented knowledge also provides the basic conditions for big models to access and utilize knowledge. In the previous chapters (e.g., Chaps. 9 and 11), we have explained how to conduct knowledge-guided representation learning, such as architecture reformulation and input augmentation methods. In addition, big models have been shown to have behavioral imitation capabilities to access knowledge as human beings. A typical example is WebGPT [89] which can automatically search commonsense and facts to generate more reasonable answers, as we have introduced in cognitive learning. From these examples, we can see that there are more sufficient conditions to realize innovative applications in scenarios with existing domain knowledge bases or ontologies.

Spread the wings of imagination, and we can realize that there are so many fields that big models can dabble in, from sophisticated scientific predictions (such as weather data) to smart home services in our daily life. More innovative applications are waiting for us to explore.

14.2.10 P10: Big Model Systems Accessible to Users

Due to the generalizability of pre-trained models in terms of architecture and capability, big models are expected to become a foundational infrastructure for many information services supported by NLP and AI [9], e.g., search engines, personalized recommendation, and virtual assistants, and domain-specific information organization, e.g., financial, medical, legal, and academic domains.

In particular, recent findings on parameter-efficient delta tuning [30] show that, by keeping a central big model fixed, we can simply design task-specific delta objects to adapt the central model to handle multiple downstream tasks. These breakthroughs indicate a new technique paradigm in NLP, from training a task-specific model for each task separately to stimulating task-specific knowledge scattered in a unified and versatile big model for downstream tasks. Intuitively, with pre-trained big models, our focus is no longer limited to how to learn model parameters for specific tasks but how to stimulate the knowledge of big models to handle specific tasks.

Although the development trend of building unified big models for multiple tasks is clear, it is still not easy for most institutions and individuals to enjoy the power of big models due to the computation and expertise barriers as when have discussed in Chap. 13. We argue that, like the historical successful cases that database management systems (DBMS) are proposed to manage massive data and big data analytics systems (BDS) are proposed to big data mining, it is time for us to build unified management systems of big models, i.e., big model systems (BMS). Similar to DBMS and BDS that store and analyze data in a unified view, we should also design BMS to build and organize big models in a unified view. BMS is expected to provide easy and standardized interfaces for the deployment and application of big models. We should consider the following principles to design BMS accessible to general institutions and individuals.

Data Form and Operation Abstraction of Big Models

Both data form abstraction and operation abstraction enable DBMS and BDS to serve as a standard infrastructure in most companies and organizations. Examples of the data form abstraction are tables in relational DBMS (RDBMS) supported by the relational model of data [23] and resilient distributed datasets (RDD) in the Spark BDS [141]. Examples of the operation abstraction are structured query language (SQL) in RDBMS and the map and reduce functions in the MapReduce BDS [27]. Intuitively, these abstract methods can isolate users and developers of DBMS and BDS. Take DBMS for example: users only consider how to use DBMS to manage data through a series of unified interfaces, without learning how the underlying modules of DBMS that perform data management; developers, by ensuring that the interfaces provided to users remain unchanged, can have more freedom to develop and optimize the underlying modules of DBMS.

We believe big models will also serve as an infrastructure beyond DBMS and BDS for information services. The general-purpose BMS is expected to enable more persons with basic programming skills to use big models. Hence, we should have data form abstraction and operation abstraction specifically designed for big models. BMS relies on data form abstraction to support learning big models from various types of data and provide a unified operation scheme for model manipulation. With the help of prompt learning as a natural language interface between humans and big models [55, 99], we can design high-level and unified programming languages for BMS to manipulate big models and protect big model users from directly interacting with big models by sophisticated deep learning programming.

Efficient Computation and Management of Big Models

BMS should support comprehensive management of big models based on many techniques in above-mentioned topics, such as high-performance computing mentioned in P3, parameter-efficient delta tuning mentioned in P4, and safety mentioned in P6. Since the techniques of big models are still developing rapidly, BMS will actively evolve internally in physical implementation by taking advantage of these advances while keeping user interface stable.

We further argue that, with the novel adaptation technique of delta tuning, BMS should manage and schedule central big models as well as massive task-specific delta objects to support the high concurrency of user requests. Hence, we need to design efficient model scheduling manager (MSM) responsible for storing or distributing big models and delta objects in computing devices. There are many real-world scenarios that should be addressed by MSM, such as continual learning and adaptation of big models, efficient scheduling of multiple big models of various sizes and purposes, fault tolerance that can recover from hardware or network failures, and supporting heterogeneous device architectures such as cloud-edge-terminal cooperation.

In summary, we have shown the broad prospects of big models in the above-mentioned nine key problems, and we need big model systems to turn these prospects into reality accessible to general institutions and individuals. The OpenBMB introduced in Chap. 13 can be regarded as our preliminary attempt at building big model systems. As discussed in this key problem, BMS actually brings many open problems with the deployment of big models in the real world, which requires the collaboration of researchers and practitioners from deep learning and AI, high performance computing, software engineering, networking, and edge/cloud computing. We believe an efficient and effective big model system will play an essential role in making the growing capabilities of AI accessible to everyone.

14.3 Summary

In this chapter, we outlook the future of representation learning standing on the new giants of big models in 2023, as the final chapter of the book. We list ten key problems of big models, including theoretical foundation, next-generation architecture, high-performance computing, parameter-efficient delta tuning, controllable generation, safety and ethics, cross-modality, cognitive learning, innovative applications, and big model systems.

Although the summarized problems may be biased by our research experiences, we still hope they can help readers of the book find your interests. Any suggestions and comments are welcome from our community. Let’s work together on these exciting topics to contribute novel techniques and applications of AI in the future.

Notes

1.
https://www.adept.ai/act.
2.
Due to the copyright of the generated images, please go to the official website to enjoy them: https://openai.com/dall-e-2/.

References

Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, and Hanie Sedghi. Exploring the limits of large scale pre-training. In Proceedings of ICLR, 2021.
Google Scholar
David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.
Google Scholar
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. Flamingo: a visual language model for few-shot learning. In Proceedings of NeurIPS, 2022.
Google Scholar
Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019.
Google Scholar
Isaac Asimov. I, robot. 1950.
Google Scholar
Yannis Assael, Thea Sommerschield, Brendan Shillingford, Mahyar Bordbar, John Pavlopoulos, Marita Chatzipanagiotou, Ion Androutsopoulos, Jonathan Prag, and Nando de Freitas. Restoring and attributing ancient texts using deep neural networks. Nature, 603(7900):280–283, 2022.
Article Google Scholar
Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. arXiv preprint arXiv:2206.11795, 2022.
Google Scholar
Emily M Bender and Alexander Koller. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of ACL, 2020.
Google Scholar
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
Google Scholar
Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as I can, not as I say: Grounding language in robotic affordances. In Proceedings of CoRL, 2022.
Google Scholar
Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478, 2021.
Google Scholar
Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
Google Scholar
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Proceedings of NeurIPS, 2020.
Google Scholar
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In Proceedings of USENIX Security, 2021.
Google Scholar
Jean Cea. Conception optimale ou identification de formes, calcul rapide de la dérivée directionnelle de la fonction coût. ESAIM: Mathematical Modelling and Numerical Analysis-Modélisation Mathématique et Analyse Numérique, 20(3):371–402, 1986.
Article MATH Google Scholar
Ines Chami, Adva Wolf, Da-Cheng Juan, Frederic Sala, Sujith Ravi, and Christopher Ré. Low-dimensional hyperbolic knowledge graph embeddings. In Proceedings of ACL, 2020.
Google Scholar
Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values. Scientific reports, 8(1):1–12, 2018.
Article Google Scholar
Jie Chen, Tengfei Ma, and Cao Xiao. FastGCN: Fast learning with graph convolutional networks via importance sampling. In Proceedings of ICLR, 2018.
Google Scholar
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
Google Scholar
Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Proceedings of NeurIPS, 2018.
Google Scholar
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
Google Scholar
Weize Chen, Xu Han, Yankai Lin, Hexu Zhao, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. Fully hyperbolic neural networks. In Proceedings of ACL, 2022.
Google Scholar
Edgar F Codd. A relational model of data for large shared data banks. Communications of the ACM, 13(6):377–387, 1970.
Google Scholar
Allan Dafoe. Ai governance: a research agenda. Governance of AI Program, Future of Humanity Institute, University of Oxford: Oxford, UK, 1442:1443, 2018.
Google Scholar
Kahneman Daniel. Thinking, fast and slow, 2017.
Google Scholar
Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, Prasad Joshi, Nabil Imam, Shweta Jain, et al. Loihi: A neuromorphic manycore processor with on-chip learning. Ieee Micro, 38(1):82–99, 2018.
Google Scholar
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
Article Google Scholar
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proceedings of CVPR, 2009.
Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, 2019.
Google Scholar
Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv preprint arXiv:2203.06904, 2022.
Google Scholar
Linhao Dong, Shuang Xu, and Bo Xu. Speech-Transformer: a no-recurrence sequence-to-sequence model for speech recognition. In Proceedings of ICASSP, 2018.
Google Scholar
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of ICLR, 2021.
Google Scholar
David Eigen, Marc’Aurelio Ranzato, and Ilya Sutskever. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013.
Google Scholar
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A mathematical framework for transformer circuits, 2021.
Google Scholar
Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. A guide to deep learning in healthcare. Nature medicine, 25(1):24–29, 2019.
Article Google Scholar
Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Proceedings of ACL, 2018.
Google Scholar
Chris Finlay, Jörn-Henrik Jacobsen, Levon Nurbekyan, and Adam Oberman. How to train your neural ODE: the world of jacobian and kinetic regularization. In Proceedings of ICML, 2020.
Google Scholar
Wulfram Gerstner and Werner M Kistler. Spiking neuron models: Single neurons, populations, plasticity. Cambridge university press, 2002.
Google Scholar
Walter Goodwin, Sagar Vaze, Ioannis Havoutis, and Ingmar Posner. Semantically grounded object matching for robust robotic scene rearrangement. In Proceedings of ICRA, 2022.
Google Scholar
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. In Proceedings of ICML, 2020.
Google Scholar
Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In Proceedings of ICLR, 2016.
Google Scholar
Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan Yao, Ao Zhang, Liang Zhang, et al. Pre-trained models: Past, present and future. AI Open, 2021.
Google Scholar
Jeff Z HaoChen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for self-supervised deep learning with spectral contrastive loss. In Proceedings of NeurIPS, 2021.
Google Scholar
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of CVPR, 2022.
Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of CVPR, 2016.
Google Scholar
Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
Google Scholar
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 1997.
Google Scholar
John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
Google Scholar
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In Proceedings of ICML, 2019.
Google Scholar
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. In Proceedings of ICLR, 2021.
Google Scholar
Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing. Toward controlled generation of text. In Proceedings of ICML, 2017.
Google Scholar
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. GPipe: Efficient training of giant neural networks using pipeline parallelism. In Proceedings of NeurIPS, 2019.
Google Scholar
David H Hubel and Torsten N Wiesel. Receptive fields of single neurones in the cat’s striate cortex. The Journal of physiology, 148(3):574, 1959.
Google Scholar
RA Jacobs, MI Jordan, SJ Nowlan, and GE Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87, 1991.
Article Google Scholar
Ellen Jiang, Kristen Olson, Edwin Toh, Alejandra Molina, Aaron Michael Donsbach, Michael Terry, and Carrie Jun Cai. Prompt-based prototyping with large language models. 2022.
Google Scholar
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In Proceedings of OSDI, 2020.
Google Scholar
John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021.
Google Scholar
Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019.
Google Scholar
Muhammad Khalifa, Hady Elsahar, and Marc Dymetman. A distributional approach to controlled text generation. In Proceedings of ICLR, 2021.
Google Scholar
Saeed Reza Kheradpisheh, Mohammad Ganjtabesh, Simon J Thorpe, and Timothée Masquelier. Stdp-based spiking deep convolutional neural networks for object recognition. Neural Networks, 99:56–67, 2018.
Google Scholar
Patrick Kidger, James Foster, Xuechen Li, and Terry J Lyons. Neural SDEs as infinite-dimensional gans. In Proceedings of ICML, 2021.
Google Scholar
Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, and Zachary Tatlock. Dynamic tensor rematerialization. In Proceedings of ICLR, 2020.
Google Scholar
Teuvo Kohonen. Self-organized formation of topologically correct feature maps. Biological cybernetics, 43(1):59–69, 1982.
Article MathSciNet MATH Google Scholar
Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. Gedi: Generative discriminator guided sequence generation. In Findings of EMNLP, 2021.
Google Scholar
Ranjay Krishna, Donsuk Lee, Li Fei-Fei, and Michael S Bernstein. Socially situated artificial intelligence enables learning from human interaction. The National Academy of Sciences, 119(39):e2115730119, 2022.
Google Scholar
Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.
Google Scholar
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In Proceedings of NeurIPS, 2012.
Google Scholar
Guillaume Lample, Marie-Anne Lachaux, Thibaut Lavril, Xavier Martinet, Amaury Hayat, Gabriel Ebner, Aurélien Rodriguez, and Timothée Lacroix. Hypertree proof search for neural theorem proving. arXiv preprint arXiv:2205.11491, 2022.
Google Scholar
Yann LeCun. A path towards autonomous machine intelligence. Openreview, 2022.
Google Scholar
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 1998.
Google Scholar
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of EMNLP, 2021.
Google Scholar
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of NeurIPS, 2020.
Google Scholar
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In Proceedings of NeurIPS, 2018.
Google Scholar
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learning with momentum distillation. In Proceedings of NeurIPS, 2021.
Google Scholar
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of ACL-IJCNLP, 2021.
Google Scholar
Xuechen Li, Ting-Kam Leonard Wong, Ricky TQ Chen, and David K Duvenaud. Scalable gradients and variational inference for stochastic differential equations. In Proceedings of AABI, 2020.
Google Scholar
Zheng Li, Zijian Wang, Ming Tan, Ramesh Nallapati, Parminder Bhatia, Andrew Arnold, Bing Xiang, and Dan Roth. DQ-BART: Efficient sequence-to-sequence model via joint distillation and quantization. In Proceedings of ACL, 2022.
Google Scholar
Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. DExperts: Decoding-time controlled text generation with experts and anti-experts. In Proceedings of ACL, 2021.
Google Scholar
Jing Liu, Xinxin Zhu, Fei Liu, Longteng Guo, Zijia Zhao, Mingzhen Sun, Weining Wang, Hanqing Lu, Shiyu Zhou, Jiajun Zhang, et al. OPT: omni-perception pre-trainer for cross-modal understanding and generation. arXiv preprint arXiv:2107.00249, 2021.
Google Scholar
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of ICCV, 2021.
Google Scholar
Zhiyuan Liu, Yankai Lin, and Maosong Sun. Representation Learning for Natural Language Processing. Springer, 2020.
Book Google Scholar
Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In Proceedings of ICML, 2018.
Google Scholar
Yi Ma, Doris Tsao, and Heung-Yeung Shum. On the principles of parsimony and self-consistency for the emergence of intelligence. Frontiers of Information Technology & Electronic Engineering, pages 1–26, 2022.
Google Scholar
Emile Mathieu and Maximilian Nickel. Riemannian continuous normalizing flows. In Proceedings of NeurIPS, 2020.
Google Scholar
Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022.
Google Scholar
Jeff Mitchell and Mirella Lapata. Composition in distributional models of semantics. Cognitive science, 34(8):1388–1429, 2010.
Article Google Scholar
Milad Mozafari, Mohammad Ganjtabesh, Abbas Nowzari-Dalini, Simon J Thorpe, and Timothée Masquelier. Bio-inspired digit recognition using reward-modulated spike-timing-dependent plasticity in deep convolutional networks. Pattern recognition, 94:87–95, 2019.
Google Scholar
Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. Multimodal contrastive learning with limoe: the language-image mixture of experts. arXiv preprint arXiv:2206.02770, 2022.
Google Scholar
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
Google Scholar
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of SOSP, 2019.
Google Scholar
Maximillian Nickel and Douwe Kiela. Poincare embeddings for learning hierarchical representations. In Proceedings of NeurIPS, 2017.
Google Scholar
Jing Qian, Li Dong, Yelong Shen, Furu Wei, and Weizhu Chen. Controllable natural language generation with contrastive prefixes. In Findings of ACL, 2022.
Google Scholar
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020.
Google Scholar
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 2022.
Google Scholar
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of KDD, 2020.
Google Scholar
Tiernan Ray. Meta’s ai guru lecun: Most of today’s ai approaches will never lead to true intelligence, 2022.
Google Scholar
Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
Google Scholar
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of NeurIPS, 2015.
Google Scholar
Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Proceedings of CHI, 2021.
Google Scholar
Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W Keckler. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In Proceedings of MICRO, 2016.
Google Scholar
Yulia Rubanova, Ricky TQ Chen, and David K Duvenaud. Latent ordinary differential equations for irregularly-sampled time series. In Proceedings of NeurIPS, 2019.
Google Scholar
Frederic Sala, Chris De Sa, Albert Gu, and Christopher Ré. Representation tradeoffs for hyperbolic embeddings. In Proceedings of ICML, 2018.
Google Scholar
Roger N Shepard. Toward a universal law of generalization for psychological science. Science, 237(4820):1317–1323, 1987.
Google Scholar
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
Google Scholar
Connor Shorten, Taghi M Khoshgoftaar, and Borko Furht. Deep learning applications for covid-19. Journal of Big Data, 8(1):1–54, 2021.
Google Scholar
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipulation. In Proceedings of CoRL, 2022.
Google Scholar
SB Shriram, Anshuj Garg, and Purushottam Kulkarni. Dynamic memory management for GPU-based training of deep neural networks. In Proceedings of IPDPS, 2019.
Google Scholar
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
Google Scholar
Linfeng Song, Ante Wang, Jinsong Su, Yue Zhang, Kun Xu, Yubin Ge, and Dong Yu. Structural information preserving for graph-to-text generation. In Proceedings of ACL, 2020.
Google Scholar
Qingyuan Song, Weiping Fu, Wen Wang, Yuan Sun, Denggui Wang, and Jincao Zhou. Quantum decision making in automatic driving. Scientific reports, 12(1):1–15, 2022.
Google Scholar
Nima Tajbakhsh, Jae Y Shin, Suryakanth R Gurudu, R Todd Hurst, Christopher B Kendall, Michael B Gotway, and Jianming Liang. Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE transactions on medical imaging, 35(5):1299–1312, 2016.
Google Scholar
Guangzhi Tang, Neelesh Kumar, Raymond Yoo, and Konstantinos Michmizos. Deep reinforcement learning with population-coded spiking neural network for continuous control. In Proceedings of CoRL, 2021.
Google Scholar
David Tarditi, Sidd Puri, and Jose Oglesby. Accelerator: using data parallelism to program GPUs for general-purpose uses. ACM SIGPLAN Notices, 41(11):325–335, 2006.
Article Google Scholar
Michel Thiebaut de Schotten and Stephanie J Forkel. The emergent properties of the connected brain. Science, 378(6619):505–510, 2022.
Google Scholar
Richard F Thompson and Jeansok J Kim. Memory systems in the brain and localization of a memory. PNAS, 93(24):13438–13444, 1996.
Google Scholar
Christopher Tosh, Akshay Krishnamurthy, and Daniel Hsu. Contrastive learning, multi-view redundancy, and linear models. In Proceedings of ALT, 2021.
Google Scholar
Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses. In Proceedings of NeurIPS, 2020.
Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Llion Jones, Jakob Uszkoreit, Aidan N Gomez, and Lukasz Kaiser. Attention is all you need. In Proceedings of NeurIPS, 2017.
Google Scholar
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing nlp. In Proceedings of EMNLP-IJCNLP, 2019.
Google Scholar
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of EMNLP Workshop, 2018.
Google Scholar
Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. Hat: Hardware-aware transformers for efficient natural language processing. In Proceedings of ACL, 2020.
Google Scholar
Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia. PET: Optimizing tensor programs with partially equivalent transformations and automated corrections. In Proceedings of OSDI, 2021.
Google Scholar
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Proceedings of ICML, 2022.
Google Scholar
Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: BEiT pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
Google Scholar
Zhehui Wang, Tao Luo, Rick Siow Mong Goh, and Joey Tianyi Zhou. Edcompress: Energy-aware model compression for dataflows. IEEE Transactions on Neural Networks and Learning Systems, 2022.
Google Scholar
Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. SimVLM: Simple visual language model pretraining with weak supervision. In Proceedings of ICLR, 2021.
Google Scholar
Colin Wei, Kendrick Shen, Yining Chen, and Tengyu Ma. Theoretical analysis of self-training with deep networks on unlabeled data. arXiv preprint arXiv:2010.03622, 2020.
Google Scholar
Colin Wei, Sang Michael Xie, and Tengyu Ma. Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning. In Proceedings of NeurIPS, 2021.
Google Scholar
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
Google Scholar
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
Google Scholar
Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
Google Scholar
E Weinan. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 1(5):1–11, 2017.
MathSciNet MATH Google Scholar
Margaret Wilson. Six views of embodied cognition. Psychonomic bulletin & review, 9(4):625–636, 2002.
Article Google Scholar
Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
Google Scholar
Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. Deebert: Dynamic early exiting for accelerating bert inference. In Proceedings of ACL, 2020.
Google Scholar
Ziqing Yang, Yiming Cui, and Zhigang Chen. TextPruner: A model pruning toolkit for pre-trained language models. In Proceedings of ACL Demonstrations, 2022.
Google Scholar
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. arXiv preprint arXiv:2207.01206, 2022.
Google Scholar
Yuan Yao, Qianyu Chen, Ao Zhang, Wei Ji, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. PEVL: Position-enhanced pre-training and prompt tuning for vision-language models. In Proceedings of EMNLP, 2022.
Google Scholar
Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. CPT: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797, 2021.
Google Scholar
Anthony Zador, Blake Richards, Bence Ölveczky, Sean Escola, Yoshua Bengio, Kwabena Boahen, Matthew Botvinick, Dmitri Chklovskii, Anne Churchland, Claudia Clopath, et al. Toward next-generation artificial intelligence: Catalyzing the neuroai revolution. arXiv preprint arXiv:2210.08340, 2022.
Google Scholar
Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J Franklin, et al. Apache spark: a unified engine for big data processing. Communications of the ACM, 59(11):56–65, 2016.
Google Scholar
Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of ACL, 2021.
Google Scholar
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. In Proceedings of CVPR, 2019.
Google Scholar
Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. In Proceedings of NeurIPS, 2019.
Google Scholar
Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of ACL, 2018.
Google Scholar
Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. In Proceedings of ICLR, 2020.
Google Scholar
Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Gan, Chris Brockett, and Bill Dolan. POINTER: Constrained progressive text generation via insertion-based generative pre-training. In Proceedings of EMNLP, 2020.
Google Scholar
Zhengyan Zhang, Baitao Gong, Yingfa Chen, Xu Han, Guoyang Zeng, Weilin Zhao, Yanxu Chen, Zhiyuan Liu, and Maosong Sun. BMCook: A task-agnostic compression toolkit for big models. In Proceedings of EMNLP: System Demonstrations, 2022.
Google Scholar
Ningxin Zheng, Bin Lin, Quanlu Zhang, Lingxiao Ma, Yuqing Yang, Fan Yang, Yang Wang, Mao Yang, and Lidong Zhou. SparTA: Deep-learning model sparsity via tensor-with-sparsity-attribute. In Proceedings of OSDI, 2022.
Google Scholar
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
Google Scholar

Download references

Acknowledgements

The contributions of all authors are as follows: Zhiyuan Liu, Yankai Lin, and Maosong Sun designed the overall architecture of this chapter; Zhiyuan Liu initiated the discussions about ten key problems of pre-trained models with all authors and drafted the introduction; Ning Ding drafted P1 and P4, Weize Chen drafted P2, Zhengyan Zhang drafted P3, Shengding Hu drafted P5, Ganqu Cui drafted P6, Yuan Yao drafted P7, Yujia Qin drafted P8, Zheni Zeng drafted P9, and Xu Han drafted P10; Ning Ding, Yankai Lin, and Zhiyuan Liu revised and proofread the chapter.

This is the newly complemented chapter about the outlook for representation learning of the second edition of the book Representation Learning for Natural Language Processing. The first edition of the book was published in 2020 [81].

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China
Ning Ding, Weize Chen, Zhengyan Zhang, Shengding Hu, Ganqu Cui, Yuan Yao, Yujia Qin, Zheni Zeng, Xu Han, Zhiyuan Liu & Maosong Sun
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
Yankai Lin

Authors

Ning Ding
View author publications
You can also search for this author in PubMed Google Scholar
Weize Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zhengyan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shengding Hu
View author publications
You can also search for this author in PubMed Google Scholar
Ganqu Cui
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Yao
View author publications
You can also search for this author in PubMed Google Scholar
Yujia Qin
View author publications
You can also search for this author in PubMed Google Scholar
Zheni Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Xu Han
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yankai Lin
View author publications
You can also search for this author in PubMed Google Scholar
Maosong Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiyuan Liu .

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China
Zhiyuan Liu
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
Yankai Lin
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Maosong Sun

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ding, N. et al. (2023). Ten Key Problems of Pre-trained Models: An Outlook of Representation Learning. In: Liu, Z., Lin, Y., Sun, M. (eds) Representation Learning for Natural Language Processing. Springer, Singapore. https://doi.org/10.1007/978-981-99-1600-9_14

Download citation

DOI: https://doi.org/10.1007/978-981-99-1600-9_14
Published: 24 August 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1599-6
Online ISBN: 978-981-99-1600-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics