Pre-trained models for natural language processing: A survey

Qiu, XiPeng; Sun, TianXiang; Xu, YiGe; Shao, YunFan; Dai, Ning; Huang, XuanJing

doi:10.1007/s11431-020-1647-3

Pre-trained models for natural language processing: A survey

Review
Published: 15 September 2020

Volume 63, pages 1872–1897, (2020)
Cite this article

Science China Technological Sciences Aims and scope Submit manuscript

XiPeng Qiu^1,2,
TianXiang Sun^1,2,
YiGe Xu^1,2,
YunFan Shao^1,2,
Ning Dai^1,2 &
…
XuanJing Huang^1,2

10k Accesses
624 Citations
356 Altmetric
26 Mentions
Explore all metrics

Abstract

Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next, we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Survey of Pretrained Language Models

A Review on Natural Language Processing: Back to Basics

Representation Learning and NLP

References

Kalchbrenner N, Grefenstette E, Blunsom P. A convolutional neural network for modelling sentences. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Baltimore, 2014. 655–665
Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Doha, 2014. 1746–1751
Gehring J, Auli M, Grangier D, et al. Convolutional sequence to sequence learning. In: Proceedings of the International Conference on Machine Learning. Sydney, 2017. 1243–1252
Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In: Proceedings of the Advances in Neural Information Processing Systems. Montreal, 2014. 3104–3112
Liu P, Qiu X, Huang X. Recurrent neural network for text classification with multi-task learning. In: Proceedings of the International Joint Conference on Artificial Intelligence. New York, 2016. 2873–2879
Socher R, Perelygin A, Wu J Y, et al. Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Seattle, 2013. 1631–1642
Tai K S, Socher R, Manning C D. Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Beijing, 2015. 1556–1566
Marcheggiani D, Bastings J, Titov I. Exploiting semantics in neural machine translation with graph convolutional networks. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, 2018. 486–492
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Proceedings of the International Conference on Learning Representations. San Diego, 2015
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of the Advances in Neural Information Processing Systems. Long Beach, 2017. 5998–6008
Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. In: Proceedings of the Advances in Neural Information Processing Systems. Lake Tahoe, 2013. 3111–3119
Pennington J, Socher R, Manning C D. GloVe: Global vectors for word representation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Doha, 2014. 1532–1543
McCann B, Bradbury J, Xiong C, et al. Learned in translation: Contextualized word vectors. In: Proceedings of the Advances in Neural Information Processing Systems. Long Beach, 2017. 6294–6305
Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, 2018. 2227–2237
Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training. OpenAI Blog. 2018
Devlin J, Chang M, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. 4171–4186
Bengio Y, Courville A, Vincent P. Representation learning: A review and new perspectives. IEEE Trans Pattern Anal Mach Intell, 2013, 35: 1798–1828
Article Google Scholar
Kim Y, Jernite Y, Sontag D, et al. Character-aware neural language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. Phoenix, 2016. 2741–2749
Bojanowski P, Grave E, Joulin A, et al. Enriching word vectors with subword information. Trans Associat Comput Linguist, 2017, 5: 135–146
Article Google Scholar
Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Berlin, 2016
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation. 1997. 9: 1735–1780
Article Google Scholar
Chung J, Gulcehre C, Cho K, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv: 1412.3555
Zhu X, Sobihani P, Guo H. Long short-term memory over recursive structures. In: Proceedings of the International Conference on Machine Learning. Lille, 2015. 1604–1612
Kipf T N and Welling M. Semi-supervised classification with graph convolutional networks. In: Proceedings of the International Conference on Learning Representations. Toulon, 2017
Guo Q, Qiu X, Liu P, et al. Star-transformer. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. 1315–1325
Erhan D, Bengio Y, Courville A C, et al. Why does unsupervised pre-training help deep learning? J Mach Learn Res, 2010, 11: 625–660
MathSciNet MATH Google Scholar
Hinton G E. Reducing the dimensionality of data with neural networks. Science, 2006, 313: 504–507
Article MathSciNet Google Scholar
Hinton G, McClelland J, Rumelhart D. Distributed representations. The Philosophy of Artificial Intelligence, 1990, 248–280
Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model. J Mach Learn Res, 2003, 3: 1137–1155
MATH Google Scholar
Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch. J Mach Learn Res, 2011, 12: 2493–2537
MATH Google Scholar
Le Q, Mikolov T. Distributed representations of sentences and documents. In: Proceedings of the International Conference on Machine Learning. Beijing, 2014. 1188–1196
Kiros R, Zhu Y, Salakhutdinov R R, et al. Skip-thought vectors. In: Proceedings of the Advances in Neural Information Processing Systems. Montreal, 2015. 3294–3302
Melamud O, Goldberger J, Dagan I. Context2Vec: Learning generic context embedding with bidirectional LSTM. In: Proceedings of the Conference on Computational Natural Language Learning. Berlin, 2016. 51–61
Dai A M and Le Q V. Semi-supervised sequence learning. In: Proceedings of the Advances in Neural Information Processing Systems. Montreal, 2015. 3079–3087
Ramachandran P, Liu P J, Le Q. Unsupervised pretraining for sequence to sequence learning. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Copenhagen, 2017. 383–391
Akbik A, Blythe D, Vollgraf R. Contextual string embeddings for sequence labeling. In: Proceedings of the International Conference on Computational Linguistics. Santa Fe, 2018. 1638–1649
Howard J and Ruder S. Universal language model fine-tuning for text classification. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Melbourne, 2018. 328–339
Baevski A, Edunov S, Liu Y, et al. Cloze-driven pretraining of self-attention networks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 5359–5368
Taylor W L. “Cloze Procedure”: A new tool for measuring readability. Jism Q, 1953, 30: 415–433
Google Scholar
Song K, Tan X, Qin T, et al. MASS: Masked sequence to sequence pre-training for language generation. In: Proceedings of the International Conference on Machine Learning. Long Beach, 2019. 5926–5936
Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv: 1910.10683
Liu Y, Ott M, Goyal N, et al. RoBERTa: A robustly optimized BERT pretraining approach. ArXiv: 1907.11692
Dong L, Yang N, Wang W, et al. Unified language model pre-training for natural language understanding and generation. In: Proceedings of the Advances in Neural Information Processing Systems. Vancouver, 2019. 13042–13054
Bao H, Dong L, Wei F, et al. UniLMv2: Pseudo-masked language models for unified language model pre-training. ArXiv: 2002.12804
Conneau A, Lample G. Cross-lingual language model pretraining. In: Proceedings of the Advances in Neural Information Processing Systems. Vancouver, 2019. 7057–7067
Joshi M, Chen D, Liu Y, et al. SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 2019, 8: 64–77
Article Google Scholar
Wang W, Bi B, Yan M, et al. StructBERT: Incorporating language structures into pre-training for deep language understanding. In: Proceedings of the International Conference on Learning Representations. Addis Ababa, 2020.
Yang Z, Dai Z, Yang Y, et al. XLNet: Generalized autoregressive pretraining for language understanding. In: Proceedings of the Advances in Neural Information Processing Systems. Vancouver, 2019. 5754–5764
Lewis M, Liu Y, Goyal N, et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. ArXiv: 1910.13461
Saunshi N, Plevrakis O, Arora S, et al. A theoretical analysis of contrastive unsupervised representation learning. In: Proceedings of the International Conference on Machine Learning. Long Beach, 2019. 5628–5637
Mnih A, Kavukcuoglu K. Learning word embeddings efficiently with noise-contrastive estimation. In: Proceedings of the Advances in Neural Information Processing Systems. Lake Tahoe, 2013. 2265–2273
Gutmann M, Hyvärinen A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the International Conference on Artificial Intelligence and Statistics. Chia Laguna Resort, 2010. 297–304
Hjelm R D, Fedorov A, Lavoie-Marchildon S, et al. Learning deep representations by mutual information estimation and maximization. In: Proceedings of the International Conference on Learning Representations. New Orleans, 2019
Kong L, de Masson d’Autume C, Yu L, et al. A mutual information maximization perspective of language representation learning. In: Proceedings of the International Conference on Learning Representations. New Orleans, 2019
Clark K, Luong M T, Le Q V, et al. ELECTRA: Pre-training text encoders as discriminators rather than generators. In: Proceedings of the International Conference on Learning Representations. Addis Ababa, 2020
Xiong W, Du J, Wang W Y, et al. Pretrained encyclopedia: Weakly supervised knowledge-pretrained language model. In: Proceedings of the International Conference on Learning Representations. Addis Ababa, 2020
Lan Z, Chen M, Goodman S, et al. ALBERT: A lite BERT for self-supervised learning of language representations. In: Proceedings of the International Conference on Learning Representations. Addis Ababa, 2020
de Vries W, van Cranenburgh A, Bisazza A, et al. BERTje: A Dutch BERT model. ArXiv: 1912.09582
Wang X, Gao T, Zhu Z, et al. KEPLER: A unified model for knowledge embedding and pre-trained language representation. ArXiv: 1911.06136
Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners. OpenAI Blog. 2019
Zhang Z, Han X, Liu Z, et al. ERNIE: enhanced language representation with informative entities. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Florence, 2019. 1441–1451.
Peters M E, Neumann M, IV R L L, et al. Knowledge enhanced contextual word representations. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 43–54
Liu W, Zhou P, Zhao Z, et al. K-BERT: Enabling language representation with knowledge graph. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 2901–2908
Chi Z, Dong L, Wei F, et al. Cross-lingual natural language generation via pre-training. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 7570–7577
Liu Y, Gu J, Goyal N, et al. Multilingual denoising pre-training for neural machine translation. ArXiv: 2001.08210
Conneau A, Khandelwal K, Goyal N, et al. Unsupervised cross-lingual representation learning at scale. ArXiv: 1911.02116
Ke P, Ji H, Liu S, et al. SentiLR: Linguistic knowledge enhanced language representation for sentiment analysis. ArXiv: 1911.02493
Huang H, Liang Y, Duan N, et al. Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 2485–2494
Eisenschlos J, Ruder S, Czapla P, et al. MultiFiT: Efficient multilingual language model fine-tuning. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 5701–5706
Sun Y, Wang S, Li Y, et al. ERNIE: enhanced representation through knowledge integration. ArXiv: 1904.09223
Cui Y, Che W, Liu T, et al. Pre-training with whole word masking for chinese BERT. ArXiv: 1906.08101
Wei J, Ren X, Li X, et al. NEZHA: Neural contextualized representation for chinese language understanding. ArXiv: 1909.00204
Diao S, Bai J, Song Y, et al. ZEN: Pre-training chinese text encoder enhanced by n-gram representations. ArXiv: 1911.00720
Martin L, Müller B, Suárez P J O, et al. CamemBERT: A tasty French language model. ArXiv: 1911.03894
Le H, Vial L, Frej J, et al. FlauBERT: Unsupervised language model pre-training for French. ArXiv: 1912.05372
Delobelle P, Winters T, Berendt B. RobBERT: A Dutch RoBERTa-based language model. ArXiv: 2001.06286
Lu J, Batra D, Parikh D, et al. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Proceedings of the Advances in Neural Information Processing Systems. Vancouver, 2019. 13–23
Tan H, Bansal M. LXMERT: Learning cross-modality encoder representations from transformers. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 5099–5110
Li L H, Yatskar M, Yin D, et al. VisualBERT: A simple and performant baseline for vision and language. ArXiv: 1908.03557
Alberti C, Ling J, Collins M, et al. Fusion of detected objects in text for visual question answering. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 2131–2140
Su W, Zhu X, Cao Y, et al. VL-BERT: Pre-training of generic visual-linguistic representations. In: Proceedings of the International Conference on Learning Representations. Addis Ababa, 2020
Sun C, Myers A, Vondrick C, et al. VideoBERT: A joint model for video and language representation learning. In: Proceedings of the International Conference on Computer Vision. Seoul, 2019. 7463–7472
Sun C, Baradel F, Murphy K, et al. Contrastive bidirectional transformer for temporal representation learning. ArXiv: 1906.05743
Chuang Y, Liu C, Lee H. SpeechBERT: Cross-modal pre-trained language model for end-to-end spoken question answering. ArXiv: 1910.11559
Lee J, Yoon W, Kim S, et al. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 2020, 36: 1234–1240
Google Scholar
Beltagy I, Lo K, Cohan A. SciBERT: A pretrained language model for scientific text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 3613–3618
Lee J and Hsiang J. PatentBERT: Patent classification with fine-tuning a pre-trained BERT model. ArXiv: 1906.02124
Gordon M A, Duh K, Andrews N. Compressing BERT: Studying the effects of weight pruning on transfer learning. ArXiv: 2002.08307
Shen S, Dong Z, Ye J, et al. Q-BERT: Hessian based ultra low precision quantization of BERT. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 8815–8821
Zafrir O, Boudoukh G, Izsak P, et al. Q8BERT: Quantized 8bit BERT. ArXiv: 1910.06188
Sanh V, Debut L, Chaumond J, et al. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. ArXiv: 1910.01108
Jiao X, Yin Y, Shang L, et al. TinyBERT: Distilling BERT for natural language understanding. ArXiv: 1909.10351
Wang W, Wei F, Dong L, et al. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. ArXiv: 2002.10957
Xu C, Zhou W, Ge T, et al. BERT-of-Theseus: Compressing BERT by progressive module replacing. ArXiv: 2002.02925
Mikolov T, Yih W, Zweig G. Linguistic regularities in continuous space word representations. In: Proceedings of the Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics. Atlanta, 2013. 746–751
Rubinstein D, Levi E, Schwartz R, et al. How well do distributional models capture different types of semantic knowledge? In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Beijing, 2015. 726–730
Gupta A, Boleda G, Baroni M, et al. Distributional vectors encode referential attributes. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Lisbon, 2015. 12–21
Tenney I, Xia P, Chen B, et al. What do you learn from context? Probing for sentence structure in contextualized word representations. In: Proceedings of the International Conference on Learning Representations. New Orleans, 2019
Liu N F, Gardner M, Belinkov Y, et al. Linguistic knowledge and transferability of contextual representations. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. 1073–1094
Tenney I, Das D, Pavlick E. BERT rediscovers the classical NLP pipeline. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Florence, 2019. 4593–4601
Goldberg Y. Assessing BERT’s syntactic abilities. ArXiv: 1901.05287
Ettinger A. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Trans Associat Comput Linguist, 2020, 8: 34–48
Article Google Scholar
Hewitt J, Manning C D. A structural probe for finding syntax in word representations. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. 4129–4138
Jawahar G, Sagot B, Seddah D. What does BERT learn about the structure of language? In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 3651–3657
Kim T, Choi J, Edmiston D, et al. Are pre-trained language models aware of phrases? Simple but strong baselines for grammar induction. In: Proceedings of the International Conference on Learning Representations. Addis Ababa, 2020
Reif E, Yuan A, Wattenberg M, et al. Visualizing and measuring the geometry of BERT. In: Proceedings of the Advances in Neural Information Processing Systems. Vancouver, 2019. 8592–8600
Petroni F, Rocktäschel T, Riedel S, et al. Language models as knowledge bases? In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 2463–2473
Jiang Z, Xu F F, Araki J, et al. How can we know what language models know? ArXiv: 1911.12543
Pürner N, Waltinger U, Schütze H. BERT is not a knowledge base (yet): Factual knowledge vs. name-based reasoning in unsupervised QA. ArXiv: 1911.03681
Kassner N, Schütze H. Negated LAMA: Birds cannot fly. ArXiv: 1911.03343
Bouraoui Z, Camacho-Collados J, Schockaert S. Inducing relational knowledge from BERT. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 7456–7463
Davison J, Feldman J, Rush A M. Commonsense knowledge mining from pretrained models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 1173–1178
Lauscher A, Vulic I, Ponti E M, et al. Informing unsupervised pre-training with external linguistic knowledge. ArXiv: 1909.02339
Wang R, Tang D, Duan N, et al. K-adapter: Infusing knowledge into pre-trained models with adapters. ArXiv: 2002.01808
Levine Y, Lenz B, Dagan O, et al. SenseBERT: Driving some sense into BERT. ArXiv: 1908.05646
Guan J, Huang F, Zhao Z, et al. A knowledge-enhanced pretraining model for commonsense story generation. ArXiv: 2001.05139
He B, Zhou D, Xiao J, et al. Integrating graph contextualized knowledge into pre-trained language models. ArXiv: 1912.00147
Wang Z, Zhang J, Feng J, et al. Knowledge graph and text jointly embedding. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Doha, 2014. 1591–1601
Zhong H, Zhang J, Wang Z, et al. Aligning knowledge and text embeddings by entity descriptions. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Lisbon, 2015. 267–272
Xie R, Liu Z, Jia J, et al. Representation learning of knowledge graphs with entity descriptions. In: Proceedings of the AAAI Conference on Artificial Intelligence. Phoenix, 2016. 2659–2665
Xu J, Qiu X, Chen K, et al. Knowledge graph representation with jointly structural and textual encoding. In: Proceedings of the International Joint Conference on Artificial Intelligence. Melbourne, 2017. 1318–1324
Yang A, Wang Q, Liu J, et al. Enhancing pre-trained language representations with rich knowledge for machine reading comprehension. In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 2346–2357
Logan IV R L, Liu N F, Peters M E, et al. Barack’s wife hillary: Using knowledge graphs for fact-aware language modeling. In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 5962–5971
Hayashi H, Hu Z, Xiong C, et al. Latent relation language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 7911–7918
Faruqui M, Dyer C. Improving vector space word representations using multilingual correlation. In: Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics. Gothenburg, 2014. 462–471
Luong M T, Pham H, Manning C D. Bilingual word representations with monolingual quality in mind. In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. Denver, 2015. 151–159
Singla K, Can D, Narayanan S. A multi-task approach to learning multilingual representations. In: Proceedings of Annual Meeting of the Association for Computational Linguistics. Melbourne, 2018. 214–220
Pires T, Schlinger E, Garrette D. How multilingual is multilingual BERT? In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Florence, 2019. 4996–5001
K K, Wang Z, Mayhew S, et al. Cross-lingual ability of multilingual BERT: An empirical study. In: Proceedings of the International Conference on Learning Representations. Addis Ababa, 2020
Virtanen A, Kanerva J, Ilo R, et al. Multilingual is not enough: BERT for Finnish. ArXiv: 1912.07076
Sun Y, Wang S, Li Y, et al. ERNIE 2.0: A continual pre-training framework for language understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 8968–8975
Kuratov Y, Arkhipov M. Adaptation of deep bidirectional multilingual transformers for russian language. ArXiv: 1905.07213
Antoun W, Baly F, Hajj H. AraBERT: Transformer-based model for Arabic language understanding. ArXiv: 2003.00104
Luo H, Ji L, Shi B, et al. UniViLM: A unified video and language pre-training model for multimodal understanding and generation. ArXiv: 2002.06353
Li G, Duan N, Fang Y, et al. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 11336–11344
Chen Y, Li L, Yu L, et al. UNITER: Learning universal image-text representations. ArXiv: 1909.11740
Huang K, Altosaar J, Ranganath R. ClinicalBERT: Modeling clinical notes and predicting hospital readmission. ArXiv: 1904.05342
Alsentzer E, Murphy J R, Boag W, et al. Publicly available clinical BERT embeddings. ArXiv: 1904.03323
Ji Z, Wei Q, Xu H. BERT-based ranking for biomedical entity normalization. ArXiv: 1908.03548
Tang M, Gandhi P, Kabir M A, et al. Progress notes classification and keyword extraction using attention-based deep learning models with BERT. ArXiv: 1910.05786
Zhang J, Zhao Y, Saleh M, et al. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. ArXiv: 1912.08777
Wang S, Che W, Liu Q, et al. Multi-task self-supervised learning for disfluency detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 9193–9200
Bucilua C, Caruana R, Niculescu-Mizil A. Model compression. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining. Philadelphia, 2006. 535–541
Ganesh P, Chen Y, Lou X, et al. Compressing large-scale transformer-based models: A case study on BERT. ArXiv: 2002.11985
Dong Z, Yao Z, Gholami A, et al. Hawq: Hessian aware quantization of neural networks with mixed-precision. In: Proceedings of the International Conference on Computer Vision. Seoul, 2019. 293–302
Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. ArXiv: 1503.02531
Sun S, Cheng Y, Gan Z, et al. Patient knowledge distillation for BERT model compression. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 4323–4332
Turc I, Chang M W, Lee K, et al. Well-read students learn better: The impact of student initialization on knowledge distillation. ArXiv: 1908.08962
Sun Z, Yu H, Song X, et al. MobileBERT: A compact task-agnostic BERT for resource-limited devices. ArXiv: 2004.02984
Zhao S, Gupta R, Song Y, et al. Extreme language model compression with optimal subwords and shared projections. ArXiv: 1909.11687
Rogers A, Kovaleva O, Rumshisky A. A primer in BERTology: What we know about how BERT works. ArXiv: 2002.12327
Michel P, Levy O, Neubig G. Are sixteen heads really better than one? In: Proceedings of the Advances in Neural Information Processing Systems. Vancouver, 2019. 14014–14024
Voita E, Talbot D, Moiseev F, et al. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 5797–5808
Dehghani M, Gouws S, Vinyals O, et al. Universal transformers. In: Proceedings of the International Conference on Learning Representations. New Orleans, 2019
Lu W, Jiao J, Zhang R. TwinBERT: Distilling knowledge to twin-structured BERT models for efficient retrieval. ArXiv: 2002.06275
Tsai H, Riesa J, Johnson M, et al. Small and practical BERT models for sequence labeling. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 3632–3636
Liu X, He P, Chen W, et al. Improving multi-task deep neural networks via knowledge distillation for natural language understanding. ArXiv: 1904.09482
Tang R, Lu Y, Liu L, et al. Distilling task-specific knowledge from BERT into simple neural networks. ArXiv: 1903.12136
Chia Y K, Witteveen S, Andrews M. Transformer to CNN: Label-scarce distillation for efficient text classification. ArXiv: 1909.03508
Liu W, Zhou P, Zhao Z, et al. FastBERT: A self-distilling BERT with adaptive inference time. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Online, 2020. 6035–6044
Pan S J, Yang Q. A survey on transfer learning. IEEE Trans Knowledge Data Eng, 2009, 22, 1345–1359
Article Google Scholar
Belinkov Y, Durrani N, Dalvi F, et al. What do neural machine translation models learn about morphology? In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Vancouver, 2017. 861–872
Peters M E, Ruder S, Smith N A. To tune or not to tune? Adapting pretrained representations to diverse tasks. In: Proceedings of the 4th Workshop on Representation Learning for NLP, RepL4NLP@ACL 2019. Florence, 2019. 7–14
Zhong M, Liu P, Wang D, et al. Searching for effective neural extractive summarization: What works and what’s next. In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 1049–1058
Zhu J, Xia Y, Wu L, et al. Incorporating BERT into neural machine translation. In: Proceedings of the International Conference on Learning Representations. Addis Ababa, 2020
Dodge J, Ilharco G, Schwartz R, et al. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. ArXiv: 2002.06305
Sun C, Qiu X, Xu Y, et al. How to fine-tune BERT for text classification? In: Proceedings of the China National Conference on Chinese Computational Linguistics. Kunming, 2019. 194–206
Phang J, Févry T, Bowman S R. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. ArXiv: 1811.01088
Garg S, Vu T, Moschitti A. Tanda: Transfer and adapt pre-trained transformer models for answer sentence selection. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 7780–7788
Li Z, Ding X, Liu T. Story ending prediction by transferable bert. In: Proceedings of the International Joint Conference on Artificial Intelligence. Macao, 2019. 1800–1806
Liu X, He P, Chen W, et al. Multi-task deep neural networks for natural language understanding. In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 4487–4496
Stickland A C, Murray I. BERT and PALs: Projected attention layers for efficient adaptation in multi-task learning. In: Proceedings of the International Conference on Machine Learning. Long Beach, 2019. 5986–5995
Houlsby N, Giurgiu A, Jastrzebski S, et al. Parameter-efficient transfer learning for NLP. In: Proceedings of the International Conference on Machine Learning. Long Beach, 2019. 2790–2799
Xu Y, Qiu X, Zhou L, et al. Improving BERT fine-tuning via self-ensemble and self-distillation. ArXiv: 2002.10345
Chronopoulou A, Baziotis C, Potamianos A. An embarrassingly simple approach for transfer learning from pretrained language models. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. 2089–2095
Li X L, Eisner J. Specializing word embeddings (for parsing) by information bottleneck. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 2744–2754
Gardner M, Grus J, Neumann M, et al. Allennlp: A deep semantic natural language processing platform. ArXiv: 1803.07640
Keskar N S, McCann B, Varshney L R, et al. CTRL: A conditional transformer language model for controllable generation. ArXiv: 1909.05858
Vig J. A multiscale visualization of attention in the transformer model. In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 37–42
Hoover B, Strobelt H, Gehrmann S. Exbert: A visual analysis tool to explore learned representations in transformers models. ArXiv: 1910.05276
Yang Z, Cui Y, Chen Z, et al. Textbrewer: An open-source knowledge distillation toolkit for natural language processing. ArXiv: 2002.12620
Wang Y, Hou Y, Che W, et al. From static to dynamic word representations: A survey. Int J Mach Learn Cyber, 2020, 11: 1611–1630
Article Google Scholar
Liu Q, Kusner M J, Blunsom P. A survey on contextual embeddings. ArXiv: 2003.07278
Wang A, Singh A, Michael J, et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the International Conference on Learning Representations. New Orleans, 2019
Wang A, Pruksachatkun Y, Nangia N, et al. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In: Proceedings of the Advances in Neural Information Processing Systems. Vancouver, 2019. 3261–3275
Rajpurkar P, Zhang J, Lopyrev K, et al. Squad: 100, 000+ questions for machine comprehension of text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Austin, 2016. 2383–2392
Reddy S, Chen D, Manning C D. CoQA: A conversational question answering challenge. Trans Associat Comput Linguist, 2019, 7: 249–266
Article Google Scholar
Yang Z, Qi P, Zhang S, et al. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Brussels, 2018. 2369–2380
Zhang Z, Yang J, Zhao H. Retrospective reader for machine reading comprehension. ArXiv: 2001.09694
Ju Y, Zhao F, Chen S, et al. Technical report on conversational question answering. ArXiv: 1909.10772
Tu M, Huang K, Wang G, et al. Select, answer and explain: Interpretable multi-hop reading comprehension over multiple documents. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 9073–9080
Bataa E, Wu J. An investigation of transfer learning-based sentiment analysis in Japanese. In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 4652–4657
Sun C, Huang L, Qiu X. Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. 380–385
Xu H, Liu B, Shu L, et al. BERT post-training for review reading comprehension and aspect-based sentiment analysis. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. 2324–2335
Rietzler A, Stabinger S, Opitz P, et al. Adapt or get left behind: Domain adaptation through BERT language model finetuning for aspect-target sentiment classification. ArXiv: 1908.11860
Karimi A, Rossi L, Prati A, et al. Adversarial training for aspect-based sentiment analysis with BERT. ArXiv: 2001.11316
Song Y, Wang J, Liang Z, et al. Utilizing BERT intermediate layers for aspect based sentiment analysis and natural language inference. ArXiv: 2002.04815
Li X, Bing L, Zhang W, et al. Exploiting BERT for end-to-end aspect-based sentiment analysis. In: Proceedings of the WNUT@Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 34–41
Wu X, Zhang T, Zang L, et al. “Mask and infill”: Applying masked language model to sentiment transfer. ArXiv: 1908.08039
Peters M E, Ammar W, Bhagavatula C, et al. Semi-supervised sequence tagging with bidirectional language models. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Vancouver, 2017. 1756–1765
Liu L, Ren X, Shang J, et al. Efficient contextualized representation: Language model pruning for sequence labeling. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Brussels, 2018. 1215–1225
Hakala K, Pyysalo S. Biomedical named entity recognition with multilingual BERT. In: Proceedings of the BioNLP Open Shared Tasks@Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 56–61
Edunov S, Baevski A, Auli M. Pre-trained language model representations for language generation. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. 4052–4059
Clinchant S, Jung K W, Nikoulina V. On the use of BERT for neural machine translation. In: Proceedings of the Proceedings of the 3rd Workshop on Neural Generation and Translation. Hong Kong, 2019. 108–117
Imamura K, Sumita E. Recycling a pre-trained BERT encoder for neural machine translation. In: Proceedings of the 3rd Workshop on Neural Generation and Translation. Hong Kong, 2019. 23–31
Zhang X, Wei F, Zhou M. HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization. In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 5059–5069
Liu Y, Lapata M. Text summarization with pretrained encoders. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 3728–3738
Zhong M, Liu P, Chen Y, et al. Extractive summarization as text matching. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Online, 2020. 6197–6208
Jin D, Jin Z, Zhou J T, et al. Is BERT really robust? Natural language attack on text classification and entailment. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 8018–8025
Wallace E, Feng S, Kandpal N, et al. Universal adversarial triggers for attacking and analyzing NLP. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 2153–2162
Sun L, Hashimoto K, Yin W, et al. Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT. ArXiv: 2003.04985
Li L, Ma R, Guo Q, et al. BERT-ATTACK: Adversarial attack against BERT using BERT. ArXiv: 2004.09984
Zhu C, Cheng Y, Gan Z, et al. FreeLB: Enhanced adversarial training for natural language understanding. In: Proceedings of the International Conference on Learning Representations. Addis Ababa, 2020
Liu X, Cheng H, He P C, et al. Adversarial training for large neural language models. ArXiv: 2004.08994
Shoeybi M, Patwary M, Puri R, et al. Megatron-LM: Training multibillion parameter language models using gpu model parallelism. ArXiv: 1909.08053
Dai Z, Yang Z, Yang Y, et al. Transformer-XL: Attentive language models beyond a fixed-length context. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Florence, 2019. 2978–2988
Zoph B, Le Q V. Neural architecture search with reinforcement learning. In: Proceedings of the International Conference on Learning Representations. Toulon, 2017
Cheng Y, Wang D, Zhou P, et al. A survey of model compression and acceleration for deep neural networks. ArXiv: 1710.09282
Wu X, Lv S, Zang L, et al. Conditional BERT contextual augmentation. In: Proceedings of the International Conference on Computational Science. Faro, 2019. 84–95
Kumar V, Choudhary A, Cho E. Data augmentation using pre-trained transformer models. ArXiv: 2003.02245
Barredo Arrieta A, Díaz-Rodríguez N, Del Ser J, et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf Fusion, 2020, 58: 82–115
Article Google Scholar
Jain S, Wallace B C. Attention is not explanation. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. 3543–3556
Serrano S, Smith N A. Is attention interpretable? In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 2931–2951

Download references

Author information

Authors and Affiliations

School of Computer Science, Fudan University, Shanghai, 200433, China
XiPeng Qiu, TianXiang Sun, YiGe Xu, YunFan Shao, Ning Dai & XuanJing Huang
Shanghai Key Laboratory of Intelligent Information Processing, Shanghai, 200433, China
XiPeng Qiu, TianXiang Sun, YiGe Xu, YunFan Shao, Ning Dai & XuanJing Huang

Authors

XiPeng Qiu
View author publications
You can also search for this author in PubMed Google Scholar
TianXiang Sun
View author publications
You can also search for this author in PubMed Google Scholar
YiGe Xu
View author publications
You can also search for this author in PubMed Google Scholar
YunFan Shao
View author publications
You can also search for this author in PubMed Google Scholar
Ning Dai
View author publications
You can also search for this author in PubMed Google Scholar
XuanJing Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to XiPeng Qiu.

Additional information

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61751201 and 61672162), the Shanghai Municipal Science and Technology Major Project (Grant No. 2018SHZDZX01) and ZJLab. We thank Zhiyuan Liu, Wanxiang Che, Minlie Huang, Danqing Wang and Luyao Huang for their valuable feedback on this manuscript.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qiu, X., Sun, T., Xu, Y. et al. Pre-trained models for natural language processing: A survey. Sci. China Technol. Sci. 63, 1872–1897 (2020). https://doi.org/10.1007/s11431-020-1647-3

Download citation

Received: 09 March 2020
Accepted: 21 May 2020
Published: 15 September 2020
Issue Date: October 2020
DOI: https://doi.org/10.1007/s11431-020-1647-3

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Pre-trained models for natural language processing: A survey

Abstract

Access this article

Similar content being viewed by others

A Survey of Pretrained Language Models

A Review on Natural Language Processing: Back to Basics

Representation Learning and NLP

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Navigation

Pre-trained models for natural language processing: A survey

Abstract

Access this article

Similar content being viewed by others

A Survey of Pretrained Language Models

A Review on Natural Language Processing: Back to Basics

Representation Learning and NLP

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation