Paradigm Shift in Natural Language Processing

In the era of deep learning, modeling for most NLP tasks has converged to several mainstream paradigms. For example, we usually adopt the sequence labeling paradigm to solve a bundle of tasks such as POS-tagging, NER, Chunking, and adopt the classification paradigm to solve tasks like sentiment analysis. With the rapid progress of pre-trained language models, recent years have observed a rising trend of Paradigm Shift, which is solving one NLP task by reformulating it as another one. Paradigm shift has achieved great success on many tasks, becoming a promising way to improve model performance. Moreover, some of these paradigms have shown great potential to unify a large number of NLP tasks, making it possible to build a single model to handle diverse tasks. In this paper, we review such phenomenon of paradigm shifts in recent years, highlighting several paradigms that have the potential to solve different NLP tasks.


Introduction
Paradigm is the general framework to model a class of tasks. For instance, sequence labeling is a mainstream paradigm for named entity recognition (NER). Different paradigms usually require different input and output, therefore highly depend on the annotation of the tasks. In the past years, modeling for most NLP tasks have converged to several mainstream paradigms, as summarized in this paper, Class, Matching, SeqLab, MRC, Seq2Seq, Seq2ASeq, and (M)LM.
Though the paradigm for many tasks has converged and dominated for a long time, recent work has shown that models under some paradigms also generalize well on tasks with other paradigms. For example, the MRC paradigm and the Seq2Seq paradigm can also achieve state-of-the-art performance on NER tasks Yan et al., 2021b), which are previously formalized in the sequence labeling (SeqLab) paradigm. Such methods typically first convert the form of the dataset to the form required by the new paradigm, and then use the model under the new paradigm to solve the task. In recent years, similar methods that reformulate a NLP task as another one have achieved great success and gained increasing attention in the community. After the emergence of the pre-trained language models (PTMs) (Devlin et al., 2019;Raffel et al., 2020;Brown et al., 2020;, paradigm shift has been observed in an increasing number of tasks. Combined with the power of these PTMs, some paradigms have shown great potential to unify diverse NLP tasks. One of these potential unified paradigms, (M)LM (also referred to as prompt-based tuning), has made rapid progress recently, making it possible to employ a single PTM as the universal solver for various understanding and generation tasks (Schick and Schütze, 2021a,b;Shin et al., 2020;Li and Liang, 2021;Liu et al., 2021b;Lester et al., 2021).
Despite their success, these paradigm shifts scattering in various NLP tasks have not been systematically reviewed and analyzed. In this paper, we attempt to summarize recent advances and trends on this line of research, namely paradigm shift or paradigm transfer.
This paper is organized as follows. In section 2, we give formal definitions of the seven paradigms, and introduce their representative tasks and instance models. In section 3, we show recent paradigm shifts happened in different NLP tasks. In section 4, we discuss designs and challenges of several highlighted paradigms that have great potential to unify most existing NLP tasks. In section 5, we conclude with a brief discussion of recent trends and future directions. 2 Paradigms in NLP 2.1 Paradigms, Tasks, and Models Typically, a task corresponds to a dataset D = Paradigm is the general modeling framework to fit some datasets (or tasks) with a specific format (i.e., the data structure of X and Y). Therefore, a task can be solved by multiple paradigms by transforming it into different formats, and a paradigm can be used to solve multiple tasks that can be formulated as the same format. A paradigm can be instantiated by a class of models with similar architectures.

The Seven Paradigms in NLP
In this paper, we mainly consider the following seven paradigms that are widely used in NLP tasks, i.e. Class, Matching, SeqLab, MRC, Seq2ASeq, and (M)LM. These paradigms have demonstrated strong dominance in many mainstream NLP tasks. In the following sections, we briefly introduce the seven paradigms and their corresponding tasks and models.

Classification (Class)
Text classification, which is designating predefined labels for text, is an essential and fundamental task in various NLP applications such as sentiment analysis, topic classification, spam detection, etc. In the era of deep learning, text classification is usually done by feeding the input text into a deep neuralbased encoder to extract the task-specific feature, which is then fed into a shallow classifier to predict the label, i.e. Y = CLS(ENC(X )). (1) Note that Y can be one-hot or multi-hot (in which case we call multi-label classification). ENC(·) can be instantiated as convolutional networks (Kim, 2014), recurrent networks (Liu et al., 2016), or Transformers (Vaswani et al., 2017). CLS(·) is usually implemented as a simple multi-layer perceptron following a pooling layer. Note that the pooling layer can be performed on the whole input text or a span of tokens.

Matching
Text matching is a paradigm to predict the semantic relevance of two texts. It is widely adopted in many fields such as information retrieval, natural language inference, question answering and dialogue systems. A matching model should not only extract the features of the two texts, but also capture their fine-grained interactions. The Matching paradigm can be simply formulated as where X a and X b are two texts to be predicted, Y can be discrete (e.g. whether one text entails or contradicts the other text) or continuous (e.g. semantic similarity between the two texts). The two texts can be separately encoded and then interact with each other (Chen et al., 2017b), or be concatenated to be fed into a single deep encoder (Devlin et al., 2019).

Sequence Labeling (SeqLab)
The Sequence Labeling (SeqLab) paradigm (also referred to as Sequence Tagging) is a fundamental paradigm modeling a variety of tasks such as partof-speech (POS) tagging, named entity recognition (NER), and text chunking. Conventional neuralbased sequence labeling models are comprised of an encoder to capture the contextualized feature for each token in the sequence, and a decoder to take in the features and predict the labels, i.e.

MRC
Machine Reading Comprehension (MRC) paradigm extracts contiguous token sequences (spans) from the input sequence conditioned on a given question. It is initially adopted to solve MRC task, then is generalized to other NLP tasks by reformulating them into the MRC format. Though, to keep consistent with prior work and avoid confusion, we name this paradigm MRC, and distinguish it from the task MRC. The MRC paradigm can be formally described as follows, where X p and X q denote passage (also referred to context) and query, and y k · · · y k+l is a span from X p or X q . Typically, DEC is implemented as two classifiers, one for predicting the starting position and one for predicting the ending position (Xiong et al., 2017;Chen et al., 2017a).

Sequence-to-Sequence (Seq2Seq)
Sequence-to-Sequence (Seq2Seq) paradigm is a general and powerful paradigm that can handle a variety of NLP tasks. Typical applications of Seq2Seq include machine translation and dialogue, where the system is supposed to output a sequence (target language or response) conditioned on a input sequence (source language or user query). Seq2Seq paradigm is typically implemented by an encoder-decoder framework (Sutskever et al., 2014;Bahdanau et al., 2015;Luong et al., 2016;Gehring et al., 2017): Different from SeqLab, the lengths of the input and output are not necessarily the same. Moreover, the decoder in Seq2Seq is usually more complicated and takes as input at each step the previous output (when testing) or the ground truth (with teacher forcing when training).

Sequence-to-Action-Sequence (Seq2ASeq)
Sequence-to-Action-Sequence (Seq2ASeq) is a widely used paradigm for structured prediction. The aim of Seq2ASeq is to predict an action sequence (also called transition sequence) from some initial configuration c 0 to a terminal configuration. The predicted action sequence should encode some legal structure such as dependency tree. The instances of the Seq2ASeq paradigm are usually called transition-based models, which can be formulated as where A = a 1 , · · · , a m is a sequence of actions, C = c 0 , · · · , c m−1 is a sequence of configurations. At each time step, the model predicts an action a t based on the input text and current configuration c t−1 , which can be comprised of top elements in stack, buffer, and previous actions (Chen and Manning, 2014;Dyer et al., 2015).

(M)LM
Language Modeling (LM) is a long-standing task in NLP, which is to estimate the probability of a given sequence of words occurring in a sentence. Due to its self-supervised fashion, language modeling and its variants, e.g. masked language modeling (MLM), are adopted as training objectives to pre-train models on large-scale unlabeled corpus. Typically, a language model can be simply formulated as where DEC can be any auto-regressive model such as recurrent networks (Bengio et al., 2000;Grave et al., 2017) and Transformer decoder (Dai et al., 2019). As a famous variant of LM, MLM can be formulated asx wherex is a corrupted version of x by replacing a portion of tokens with a special token [MASK], andx denotes the masked tokens to be predicted. DEC can be implemented as a simple classifier as in BERT (Devlin et al., 2019) or an auto-regressive Transformer decoder as in BART (Lewis et al., 2020) and T5 (Raffel et al., 2020). Though LM and MLM can be somehow different (LM is based on auto-regressive while MLM is based on auto-encoding), we categorize them into one paradigm, (M)LM, due to their same inherent nature, which is estimating the probability of some words given the context.

Compound Paradigm
In this paper, we mainly focus on fundamental paradigms (as described above) and tasks. Nevertheless, it is worth noting that more complicated NLP tasks can be solved by combining multiple fundamental paradigms. For instance, Hot-potQA (Yang et al., 2018b), a multi-hop question answering task, can be solved by combining Matching and MRC, where Matching is responsible for finding relevant documents and MRC is responsible for selecting the answer span .

Paradigm Shift in NLP Tasks
In this section, we review the paradigm shifts that occur in different NLP tasks: Text Classification, Natural Language Inference, Named Entity Recognition, Aspect-Based Sentiment Analysis, Relation Exaction, Text Summarization, and Parsing.

Text Classification
Text classification is an essential task in various NLP applications. Conventional text classification tasks can be well solved by the Class paradigm. Nevertheless, its variants such as multi-label classification can be challenging, in which case Class may be sub-optimal. To that end, Yang et al. (2018a) propose to adopt the Seq2Seq paradigm to better capture interactions between the labels for multi-label classification tasks.
In addition, the semantics hidden in the labels can not be fully exploited in the Class paradigm. Chai et al. (2020) and  adopt the Matching paradigm to predict whether the pair-wise input (X , L y ) is matched, where X is the original text and L y is the label description for class y. Though the semantic meaning of a label can be exactly defined by the samples belonging to it, incorporating prior knowledge of the label is also helpful when training data is limited.
As the rise of pre-trained language models (LMs), text classification tasks can also be solved in the (M)LM paradigm (Brown et al., 2020;Schick and Schütze, 2021a,b;. By reformulating a text classification task into a (masked) language modeling task, the gap between LM pretraining and fine-tuning is narrowed, resulting in improved performance when training data is limited.

Natural Language Inference
Natural Language Inference (NLI) is typically modeled in the Matching paradigm, where the two input texts (X a , X b ) are encoded and interact with each other, followed by a classifier to predict the relationship between them (Chen et al., 2017b). With the emergence of powerful encoder such as BERT (Devlin et al., 2019), NLI tasks can be simply solved in the Class paradigm by concatenating the two texts as one. In the case of few-shot learning, NLI tasks can also be formulated in the (M)LM paradigm by modifying the input, e.g. "X a ? [MASK] , X b ". The unfilled token [MASK] can be predicted by the MLM head as Yes/No/Maybe, corresponding to Entailment/Contradiction/Neutral (Schick and Schütze, 2021a,b;.

Named Entity Recognition
Named Entity Recognition (NER) is also a fundamental task in NLP. NER can be categorized into three subtasks: flat NER, nested NER, and discontinuous NER. Traditional methods usually solve the three NER tasks based on three paradigms respectively, i.e. SeqLab (Ma and Hovy, 2016;Lample et al., 2016), Class (Xia et al., 2019;Fisher and Vlachos, 2019), and Seq2ASeq (Lample et al., 2016;. Yu et al. (2020) and Fu et al. (2021) solve flat NER and nested NER with the Class paradigm. The main idea is to predict the label for each span in the input text. This paradigm shift introduces Example ( Example ( Example ( Example ( Example (    the span overlapping problem: The predicted entities may be overlapped, which is not allowed in flat NER. To handle this, Fu et al. (2021) adopt a heuristic decoding method: For these overlapped spans, only keep the span with the highest prediction probability.  propose to formulate flat NER and nested NER as a MRC task. They reconstruct each sample into a triplet (X , Q y , X span ), where X is the original text, Q y is the question for entity y, X span is the answer. Given context, question, and answer, the MRC paradigm can be adopted to solve this. Since there can be multiple answers (entities) in a sentence, an index matching module is developed to align the start and end indexes. Yan et al. (2021b) use a unified model based on the Seq2Seq paradigm to solve all the three kinds of NER subtasks. The input of the Seq2Seq paradigm is the original text, while the output is a sequence of span-entity pairs, for instance, "Barack Obama <Person> US <Location>". Due to the versatility of the Seq2Seq paradigm and the great power of BART (Lewis et al., 2020), this unified model achieved state-of-the-art performance on various datasets spanning all the three NER subtasks.

Aspect-Based Sentiment Analysis
Aspect-Based Sentiment Analysis (ABSA) is a finegrained sentiment analysis task with seven subtasks, i.e., Aspect Term Extraction (AE), Opinion Term Extraction (OE), Aspect-Level Sentiment Classification (ALSC), Aspect-oriented Opinion Extraction (AOE), Aspect Term Extraction and Sentiment Classification (AESC), Pair Extraction (Pair), and Triplet Extraction (Triplet). These subtasks can be solved by different paradigms. For example, ALSC can be solved by the Class paradigm, and AESC can be solved by the SeqLab paradigm.
ALSC is to predict the sentiment polarity for each target-aspect pair, e.g. (LOC1, price), given a context, e.g. "LOC1 is often considered the coolest area of London". Sun et al. (2019) formulate such a classification task into a sentence-pair matching task, and adopt the Matching paradigm to solve it. In particular, they generate auxiliary sentences (denoted as S aux ) for each target-aspect pair. For example, S aux for (LOC1, price) can be "What do you think of the price of LOC1?". The auxiliary sentence is then concatenated with the context as (S aux , X ), which is then fed into BERT (Devlin et al., 2019) to predict the sentiment.  adopt the MRC paradigm to handle all of the ABSA subtasks. In particular, they construct two queries to sequentially extract the aspect terms and their corresponding polarities and opinion terms. The first query is "Find the aspect terms in the text." Assume the answer (aspect term) predicted by the MRC model is AT, then the second query can be constructed as "Find the sentiment polarity and opinion terms for AT in the text." Through such dataset conversion, all ABSA subtasks can be solved in the MRC paradigm. Yan et al. (2021a) solve all the ABSA subtasks with the Seq2Seq paradigm by converting the original label of a subtask into a sequence of tokens, which is used as the target to train a seq2seq model. Take the Triplet Extraction subtask as an example, for a input sentence, "The drinks are always well made and wine selection is fairly priced", the output target is constructed as "drinks well made Positive wine selection fairly priced Positive". Equipped with BART (Lewis et al., 2020) as the backbone, they achieved competitive performance on most ABSA subtasks.
Very recently,  propose to formulate the ABSA subtasks in the (M)LM paradigm. In particular, for the input text X , and the aspect A and opinion O of interest, they construct a consistency prompt and a polarity prompt as: can be filled with sentiment polarity words.

Relation Exaction
Relation Extraction (RE) has two main subtasks: Relation Prediction (predicting the relationship r of two given entities s and o conditioned on their context) and Triplet Extraction (extracting triplet (s, r, o) from the input text). The former subtask is mainly solved with the Class paradigm (Zeng et al., 2014;, while the latter subtask is often solved in the pipeline style that first uses the SeqLab paradigm to extract the entities and then uses the Class paradigm to predict the relationship between the entities. Recent years have seen paradigm shift in relation extraction, especially in triplet extraction. Zeng et al. (2018) solve the triplet extraction task with the Seq2Seq paradigm. In their framework, the input of the Seq2Seq paradigm is the origi- nal text, while the output is a sequence of triplets {(r 1 , s 1 , o 1 ), · · · (r n , s n , o n )}. The copy mechanism (Gu et al., 2016) is adopted to extract entities in the text. Levy et al. (2017) address the RE task via the MRC paradigm by generating relation-specific questions. For instance, for relation educated at(s, o), a question such as "Where did s graduate from?" can be crafted to query a MRC model. Moreover, they demonstrate that formulating the RE task with MRC has a potential of zero-shot generalization to unseen relation types. Further,  and Zhao et al. (2020) formulate the triplet extraction task as multi-turn question answering and solve it with the MRC paradigm. They extract entities and relations from the text by progressively asking the MRC model with different questions.
Very recently,  formulate the RE task as a MLM task by using logic rules to construct prompts with multiple sub-prompts. By encoding prior knowledge of entities and relations into prompts, their proposed model, PTR, achieved state-of-the-art performance on multiple RE datasets.

Text Summarization
Text Summarization aims to generate a concise and informative summary of large texts. There are two different approaches to solve the text summarization task: Extractive Summarization and Abstractive Summarization. Extractive summarization approaches extract the clauses of the original text to form the final summary, which usually lies in the SeqLab paradigm. In contrast, abstractive summarization approaches usually adopt the Seq2Seq paradigm to directly generate a summary conditioned on the original text. McCann et al. (2018) reformulate the summarization task as a question answering task, where the question is "What is the summary?". Since the answer (i.e. the summary) is not necessarily comprised of the tokens in the original text, traditional MRC model cannot handle this. Therefore, the authors developed a seq2seq model to solve the summarization task in such format. Zhong et al. (2020) propose to solve the extractive summarization task in the Matching paradigm instead of the SeqLab paradigm. The main idea is to match the semantics of the original text and each candidate summary, finding the summary with the highest matching score. Compared with traditional methods of extracting sentences individually, the matching framework enables the summary extractor to work at summary level rather than sentence level. Aghajanyan et al. (2021) formulate the text summarization task in the (M)LM paradigm. They pre-train a BART-style model directly on largescale structured HTML web pages. Due to the rich semantics encoded in the HTML keywords, their pre-trained model is able to perform zero-shot text summarization by predicting the <title> element given the <body> of the document.   Figure 2. We only list the first work for each paradigm shift.

Parsing
Parsing (constituency parsing, dependency parsing, semantic parsing, etc.) plays a crucial role in many NLP applications such as machine translation and question answering. This family of tasks is to derive a structured syntactic or semantic representation from a natural language utterance. Two commonly used approaches for parsing are transition-based methods and graph-based methods. Typically, transition-based methods lie in the Seq2ASeq paradigm, and graph-based methods lie in the Class paradigm. By linearizing the target tree-structure to a sequence, parsing can be solved in the Seq2Seq paradigm (Andreas et al., 2013;Vinyals et al., 2015;Rongali et al., 2020), the SeqLab paradigm (Gómez-Rodríguez and Vilares, 2018;Strzyz et al., 2019;Vilares and Gómez-Rodríguez, 2020;Vacareanu et al., 2020), and the (M)LM paradigm (Choe and Charniak, 2016). In addition, Gan et al. (2021) employ the MRC paradigm to extract the parent span given the original sentence as the context and the child span as the question, achieving state-of-the-art performance on dependency parsing tasks across various languages.

Trends of Paradigm Shift
To intuitively depict the trend of paradigm shifts, we draw a Sankey diagram 2 in Figure 2. We track the development of the NLP tasks considered in this section, along with several additional common tasks such as event extraction. When a task is solved using a paradigm that is different from its original paradigm, some of the values of the original paradigm are transferred to the new paradigm. In particular, for each NLP task of interest, we collect published papers that solve this task from 2012 to 2021 and denote the paradigm used in 2012 as the original paradigm of this task. Then we track the paradigm shifts in all the tasks with the same original paradigm and count the number of tasks that observed paradigm shifts until 2021. For each paradigm, we denote N as the total number of tasks that branched out to new paradigms. Assume that the initial value of each paradigm is 100, and the transferred value for each out-branch is defined as 100/(N + 1). Therefore, each branch in Figure 2 indicates a task that shifted its paradigm. Table 2 lists the source data of Figure 2.
As shown in Figure 2, we find that: (1) The frequency of paradigm shift is increasing in recent years, especially after the emergence of pretrained language models (PTMs). To fully utilize the power of these PTMs, a better way is to reformulate various NLP tasks into the paradigms that PTMs are good at. (2) More and more NLP tasks have shifted from traditional paradigms such as Class, SeqLab, Seq2ASeq, to paradigms that are more general and flexible, i.e., (M)LM, Matching, MRC, and Seq2Seq, which will be discussed in the following section.

Potential Unified Paradigms in NLP
Some of the paradigms have demonstrated potential ability to formulate various NLP tasks into a unified framework. Instead of solving each task separately, such paradigms provide the possibility that a single deployed model can serve as a unified solver for diverse NLP tasks. The advantages of a single unified model over multiple task-specific models can be summarized as follows: • Data efficiency. Training task-specific models usually requires large-scale task-specific labeled data. In contrast, unified model has shown its ability to achieve considerable performance with much less labeled data.
• Generalization. Task-specific models are hard to transfer to new tasks while unified model can generalize to unseen tasks by formulating them into proper formats.
• Convenience. The unified models are easier and cheaper to deploy and serve, making them favorable as commercial black-box APIs.
In this section, we discuss the following general paradigms that have the potential to unify diverse NLP tasks: (M)LM, Matching, MRC, and Seq2Seq.

(M)LM
Reformulating downstream tasks into a (M)LM task is a natural way to utilizing the pre-trained LMs. The original input is modified with a predefined or learned prompt with some unfilled slots, which can be filled by the pre-trained LMs. Then the task labels can be derived from the filled tokens. For instance, a movie review "I love this movie" can be modified by appending a prompt as "I love this movie. It was [MASK]", in which [MASK] may be predicted as "fantastic" by the LM. Then the word "fantastic" can be mapped to the label "positive" by a verbalizer. Solving downstream tasks in the (M)LM paradigm is also referred to prompt-based learning. By fully utilizing the pretrained parameters of the MLM head instead of training a classification head from scratch, promptbased learning has demonstrated great power in few-shot and even zero-shot settings (Scao and Rush, 2021).
Prompt. The choice of prompt is critical to the performance of a particular task. A good prompt can be (1) Manually designed. Brown et al. (2020); Schick and Schütze (2021a,b) manually craft task-specific prompts for different tasks. Though it is heuristic and sometimes non-intuitive, hand-crafted prompts already achieved competitive performance on various few-shot tasks. (4) Generated by another pre-trained language model.  generate prompts using T5 (Raffel et al., 2020) since it is pre-trained to fill in missing spans in the input. (5) Learned by gradient descent. Shin et al. (2020) automatically construct prompts based on gradient-guided search. If prompt is not necessarily discrete, it can be optimized efficiently in continuous space. Recent works (Li and Liang, 2021;Qin and Eisner, 2021;Hambardzumyan et al., 2021;Liu et al., 2021b;Zhong et al., 2021) have shown that continuous prompts can also achieve competitive or even better performance.
Verbalizer. The design of verbalizer also has a strong influence on the performance of promptbased learning . A verbalizer can be (1) Manually designed. Schick and Schütze (2021a) heuristically designed verbalizers for different tasks and achieved competitive results. However, it is not always intuitive for many tasks (e.g., when class labels not directly correspond to words in the vocabulary) to manually design proper verbalizers.
Parameter-Efficient Tuning Compared with fine-tuning where all model parameters need to be tuned for each task, prompt-based tuning is also favorable in its parameter efficiency. Recent study (Lester et al., 2021) has demonstrated that tuning only prompt parameters while keeping the backbone model parameters fixed can achieve comparable performance with standard fine-tuning when models exceed billions of parameters. Due to the parameter efficiency, prompt-based tuning is a promising technique for the deployment of large-scale pre-trained LMs. In traditional finetuning, the server has to maintain a task-specific copy of the entire pre-trained LM for each downstream task, and inference has to be performed in separate batches. In prompt-based tuning, only a single pre-trained LM is required, and different tasks can be performed by modifying the inputs with task-specific prompts. Besides, inputs of different tasks can be mixed in the same batch, which makes the service highly efficient. 3

Matching
Another potential unified paradigm is Matching, or more specifically textual entailment (a.k.a. natural language inference). Textual entailment is the task of predicting two given sentences, premise and hypothesis: whether the premise entails the hypothesis, contradicts the hypothesis, or neither. Almost all text classification tasks can be reformulated as a textual entailment one (Dagan et al., 2005;Poliak et al., 2018;Yin et al., 2020;. For example, a labeled movie review {x: I love this movie, y: positive} can be modified as {x: I love this movie [SEP] This is a great movie, y: en-tailment}. Similar to pre-trained LMs, entailment models are also widely accessible. Such universal entailment models can be pre-trained LMs that are fine-tuned on some large-scale annotated entailment datasets such as MNLI (Williams et al., 2018). In addition to obtaining the entailment model in a supervised fashion,  show that the next sentence prediction head of BERT, without training on any supervised entailment data, can also achieve competitive performance on various zero-shot tasks.
Domain Adaptation The entailment model may be biased to the source domain, resulting in poor generalization to target domains. To mitigate the domain difference between the source task and the target task, Yin et al. (2020) propose the cross-task nearest neighbor module that matches instance representations and class representations in the source domain and the target domain, such that the entailment model can generalize well to new NLP tasks with limited annotations.
Label Descriptions For single sentence classification tasks, label descriptions for each class are required to be concatenated with the input text to be predicted by the entailment model. Label descriptions can be regarded as a kind of prompt to trigger the entailment model.  show that hand-crafted label descriptions with minimum domain knowledge can achieve state-of-theart performance on various few-shot tasks. Nevertheless, human-written label descriptions can be sub-optimal, Chai et al. (2020) utilize reinforcement learning to generate label descriptions.
Comparison with Prompt-Based Learning In both paradigms ((M)LM and Matching), the goal is to reformulate the downstream tasks into the pre-training task (language modeling or entailment). To achieve this, both of them need to modify the input text with some templates to prompt the pre-trained language or entailment model. In prompt-based learning, the prediction is conducted by the pre-trained MLM head on the [MASK] token, while in matching-based learning the prediction is conducted by the pre-trained classifier on the [CLS] token. In prompt-based learning, the output prediction is over the vocabulary, such that a verbalizer is required to map the predicted word in vocabulary into a task label. In contrast, matching-based learning can simply reuse the output (Entailment/Contradiction/Neutral, or Entailment/NotEntailment). Another benefit of matchingbased learning is that one can construct pairwise augmented data to perform contrastive learning, achieving further improvement of few-shot performance. However, matching-based learning requires large-scale human annotated entailment data to pre-train an entailment model, and domain difference between the source domain and target domain needs to be handled. Besides, matching-based learning can only be used in understanding tasks while prompt-based learning can also be used for generation (Li and Liang, 2021;Liu et al., 2021b).

MRC
MRC is also an alternative paradigm to unify various NLP tasks by generating task-specific questions and training a MRC model to select the correct span from the input text conditioned on the questions. Take NER as an example, one can recognize the organization entity in the input "Google was founded in 1998" by querying a MRC model with "Google was founded in 1998. Find organizations in the text, including companies, agencies and institutions" as in . In addition to NER, MRC framework has also demonstrated competitive performance in entity-relation extraction , coreference resolution , entity linking , dependency parsing (Gan et al., 2021), dialog state tracking (Gao et al., 2019), event extraction (Du and Cardie, 2020;, aspect-based sentiment analysis , etc. MRC paradigm can be applied as long as the task input can be reformulated as context, question, and answer. Due to its universality, McCann et al. (2018) proposed decaNLP to unify ten NLP tasks including question answering, machine translation, summarization, natural language inference, sentiment analysis, semantic role labeling, relation extraction, goal-oriented dialogue, semantic parsing, and commonsense pronoun resolution in a unified QA format. Different from previously mentioned works, the answer may not appear in the context and question for some tasks of decaNLP such as semantic parsing, therefore the framework is strictly not a MRC paradigm.
Comparison with Prompt-Based Learning It is worth noticing that the designed question can be analogous to the prompt in (M)LM. The verbalizer is not necessary in MRC since the answer is a span in the context or question. The predictor, MLM head in the prompt-based learning, can be replaced by a start/end classifier as in traditional MRC model or a pointer network as in McCann et al. (2018).

Seq2Seq
Seq2Seq is a general and flexible paradigm that can handle any task whose input and output can be recast as a sequence of tokens. Early work (McCann et al., 2018) has explored using the Seq2Seq paradigm to simultaneously solve different classes of tasks. Powered by recent advances of seq2seq pre-training such as MASS (Song et al., 2019), T5 (Raffel et al., 2020), and BART (Lewis et al., 2020), Seq2Seq paradigm has shown its great potential in unifying diverse NLP tasks. Paolini et al. (2021) use T5 (Raffel et al., 2020) to solve many structured prediction tasks including joint entity and relation extraction, nested NER, relation classification, semantic role labeling, event extraction, coreference resolution, and dialogue state tracking. Yan et al. (2021a) and Yan et al. (2021b) use BART (Lewis et al., 2020), equipped with the copy network (Gu et al., 2016), to unify all NER tasks (flat NER, nested NER, discontinuous NER) and all ABSA tasks (AE, OE, ALSC, AOE, AESC, Pair, Triplet), respectively.
Comparison with Other Paradigms Compared with other unified paradgms, Seq2Seq is particularly suited for complicated tasks such as structured prediction. Another benefit is that Seq2Seq is also compatible with other paradigms such as (M)LM (Raffel et al., 2020;Lewis et al., 2020), MRC (McCann et al., 2018), etc. Nevertheless, what comes with its versatility is the high latency. Currently, most successful seq2seq models are in auto-regressive fashion where each generation step depends on the previously generated tokens. Such sequential nature results in inherent latency at inference time. Therefore, more work is needed to develop efficient seq2seq models through nonautoregressive methods (Gu et al., 2018;, early exiting (Elbayad et al., 2020), or other alternative techniques.

Conclusion
Recently, prompt-based tuning, which is to formulate some NLP task into a (M)LM task, has exploded in popularity. They can achieve considerable performance with much less training data. In contrast, other potential unified paradigms, i.e. Matching, MRC, and Seq2Seq, are underexplored in the context of pre-training. One of the main reasons is that these paradigms require large-scale annotated data to conduct pre-training, especially Seq2Seq is notorious for data hungry.
Nevertheless, these paradigms have their advantages over (M)LM: Matching requires less engineering, MRC is more interpretable, Seq2Seq is more flexible to handle complicated tasks. Besides, by combining with self-supervised pre-training (e.g. BART (Lewis et al., 2020) and T5 (Raffel et al., 2020)), or further pre-training on annotated data with existing language model as initialization (e.g. ), these paradigms can achieve competitive performance or even better performance than (M)LM. Therefore, we argue that more attention is needed for the exploration of more powerful entailment, MRC, or seq2seq models through pre-training or other alternative techniques.