Skip to main content
Log in

Pre-trained models for natural language processing: A survey

  • Review
  • Published:
Science China Technological Sciences Aims and scope Submit manuscript

Abstract

Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next, we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Kalchbrenner N, Grefenstette E, Blunsom P. A convolutional neural network for modelling sentences. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Baltimore, 2014. 655–665

  2. Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Doha, 2014. 1746–1751

  3. Gehring J, Auli M, Grangier D, et al. Convolutional sequence to sequence learning. In: Proceedings of the International Conference on Machine Learning. Sydney, 2017. 1243–1252

  4. Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In: Proceedings of the Advances in Neural Information Processing Systems. Montreal, 2014. 3104–3112

  5. Liu P, Qiu X, Huang X. Recurrent neural network for text classification with multi-task learning. In: Proceedings of the International Joint Conference on Artificial Intelligence. New York, 2016. 2873–2879

  6. Socher R, Perelygin A, Wu J Y, et al. Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Seattle, 2013. 1631–1642

  7. Tai K S, Socher R, Manning C D. Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Beijing, 2015. 1556–1566

  8. Marcheggiani D, Bastings J, Titov I. Exploiting semantics in neural machine translation with graph convolutional networks. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, 2018. 486–492

  9. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: Proceedings of the International Conference on Learning Representations. San Diego, 2015

  10. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Proceedings of the Advances in Neural Information Processing Systems. Long Beach, 2017. 5998–6008

  11. Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. In: Proceedings of the Advances in Neural Information Processing Systems. Lake Tahoe, 2013. 3111–3119

  12. Pennington J, Socher R, Manning C D. GloVe: Global vectors for word representation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Doha, 2014. 1532–1543

  13. McCann B, Bradbury J, Xiong C, et al. Learned in translation: Contextualized word vectors. In: Proceedings of the Advances in Neural Information Processing Systems. Long Beach, 2017. 6294–6305

  14. Peters M E, Neumann M, Iyyer M, et al. Deep contextualized word representations. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. New Orleans, 2018. 2227–2237

  15. Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training. OpenAI Blog. 2018

  16. Devlin J, Chang M, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. 4171–4186

  17. Bengio Y, Courville A, Vincent P. Representation learning: A review and new perspectives. IEEE Trans Pattern Anal Mach Intell, 2013, 35: 1798–1828

    Article  Google Scholar 

  18. Kim Y, Jernite Y, Sontag D, et al. Character-aware neural language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. Phoenix, 2016. 2741–2749

  19. Bojanowski P, Grave E, Joulin A, et al. Enriching word vectors with subword information. Trans Associat Comput Linguist, 2017, 5: 135–146

    Article  Google Scholar 

  20. Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Berlin, 2016

  21. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation. 1997. 9: 1735–1780

    Article  Google Scholar 

  22. Chung J, Gulcehre C, Cho K, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv: 1412.3555

  23. Zhu X, Sobihani P, Guo H. Long short-term memory over recursive structures. In: Proceedings of the International Conference on Machine Learning. Lille, 2015. 1604–1612

  24. Kipf T N and Welling M. Semi-supervised classification with graph convolutional networks. In: Proceedings of the International Conference on Learning Representations. Toulon, 2017

  25. Guo Q, Qiu X, Liu P, et al. Star-transformer. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. 1315–1325

  26. Erhan D, Bengio Y, Courville A C, et al. Why does unsupervised pre-training help deep learning? J Mach Learn Res, 2010, 11: 625–660

    MathSciNet  MATH  Google Scholar 

  27. Hinton G E. Reducing the dimensionality of data with neural networks. Science, 2006, 313: 504–507

    Article  MathSciNet  Google Scholar 

  28. Hinton G, McClelland J, Rumelhart D. Distributed representations. The Philosophy of Artificial Intelligence, 1990, 248–280

  29. Bengio Y, Ducharme R, Vincent P, et al. A neural probabilistic language model. J Mach Learn Res, 2003, 3: 1137–1155

    MATH  Google Scholar 

  30. Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch. J Mach Learn Res, 2011, 12: 2493–2537

    MATH  Google Scholar 

  31. Le Q, Mikolov T. Distributed representations of sentences and documents. In: Proceedings of the International Conference on Machine Learning. Beijing, 2014. 1188–1196

  32. Kiros R, Zhu Y, Salakhutdinov R R, et al. Skip-thought vectors. In: Proceedings of the Advances in Neural Information Processing Systems. Montreal, 2015. 3294–3302

  33. Melamud O, Goldberger J, Dagan I. Context2Vec: Learning generic context embedding with bidirectional LSTM. In: Proceedings of the Conference on Computational Natural Language Learning. Berlin, 2016. 51–61

  34. Dai A M and Le Q V. Semi-supervised sequence learning. In: Proceedings of the Advances in Neural Information Processing Systems. Montreal, 2015. 3079–3087

  35. Ramachandran P, Liu P J, Le Q. Unsupervised pretraining for sequence to sequence learning. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Copenhagen, 2017. 383–391

  36. Akbik A, Blythe D, Vollgraf R. Contextual string embeddings for sequence labeling. In: Proceedings of the International Conference on Computational Linguistics. Santa Fe, 2018. 1638–1649

  37. Howard J and Ruder S. Universal language model fine-tuning for text classification. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Melbourne, 2018. 328–339

  38. Baevski A, Edunov S, Liu Y, et al. Cloze-driven pretraining of self-attention networks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 5359–5368

  39. Taylor W L. “Cloze Procedure”: A new tool for measuring readability. Jism Q, 1953, 30: 415–433

    Google Scholar 

  40. Song K, Tan X, Qin T, et al. MASS: Masked sequence to sequence pre-training for language generation. In: Proceedings of the International Conference on Machine Learning. Long Beach, 2019. 5926–5936

  41. Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv: 1910.10683

  42. Liu Y, Ott M, Goyal N, et al. RoBERTa: A robustly optimized BERT pretraining approach. ArXiv: 1907.11692

  43. Dong L, Yang N, Wang W, et al. Unified language model pre-training for natural language understanding and generation. In: Proceedings of the Advances in Neural Information Processing Systems. Vancouver, 2019. 13042–13054

  44. Bao H, Dong L, Wei F, et al. UniLMv2: Pseudo-masked language models for unified language model pre-training. ArXiv: 2002.12804

  45. Conneau A, Lample G. Cross-lingual language model pretraining. In: Proceedings of the Advances in Neural Information Processing Systems. Vancouver, 2019. 7057–7067

  46. Joshi M, Chen D, Liu Y, et al. SpanBERT: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 2019, 8: 64–77

    Article  Google Scholar 

  47. Wang W, Bi B, Yan M, et al. StructBERT: Incorporating language structures into pre-training for deep language understanding. In: Proceedings of the International Conference on Learning Representations. Addis Ababa, 2020.

  48. Yang Z, Dai Z, Yang Y, et al. XLNet: Generalized autoregressive pretraining for language understanding. In: Proceedings of the Advances in Neural Information Processing Systems. Vancouver, 2019. 5754–5764

  49. Lewis M, Liu Y, Goyal N, et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. ArXiv: 1910.13461

  50. Saunshi N, Plevrakis O, Arora S, et al. A theoretical analysis of contrastive unsupervised representation learning. In: Proceedings of the International Conference on Machine Learning. Long Beach, 2019. 5628–5637

  51. Mnih A, Kavukcuoglu K. Learning word embeddings efficiently with noise-contrastive estimation. In: Proceedings of the Advances in Neural Information Processing Systems. Lake Tahoe, 2013. 2265–2273

  52. Gutmann M, Hyvärinen A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the International Conference on Artificial Intelligence and Statistics. Chia Laguna Resort, 2010. 297–304

  53. Hjelm R D, Fedorov A, Lavoie-Marchildon S, et al. Learning deep representations by mutual information estimation and maximization. In: Proceedings of the International Conference on Learning Representations. New Orleans, 2019

  54. Kong L, de Masson d’Autume C, Yu L, et al. A mutual information maximization perspective of language representation learning. In: Proceedings of the International Conference on Learning Representations. New Orleans, 2019

  55. Clark K, Luong M T, Le Q V, et al. ELECTRA: Pre-training text encoders as discriminators rather than generators. In: Proceedings of the International Conference on Learning Representations. Addis Ababa, 2020

  56. Xiong W, Du J, Wang W Y, et al. Pretrained encyclopedia: Weakly supervised knowledge-pretrained language model. In: Proceedings of the International Conference on Learning Representations. Addis Ababa, 2020

  57. Lan Z, Chen M, Goodman S, et al. ALBERT: A lite BERT for self-supervised learning of language representations. In: Proceedings of the International Conference on Learning Representations. Addis Ababa, 2020

  58. de Vries W, van Cranenburgh A, Bisazza A, et al. BERTje: A Dutch BERT model. ArXiv: 1912.09582

  59. Wang X, Gao T, Zhu Z, et al. KEPLER: A unified model for knowledge embedding and pre-trained language representation. ArXiv: 1911.06136

  60. Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners. OpenAI Blog. 2019

  61. Zhang Z, Han X, Liu Z, et al. ERNIE: enhanced language representation with informative entities. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Florence, 2019. 1441–1451.

  62. Peters M E, Neumann M, IV R L L, et al. Knowledge enhanced contextual word representations. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 43–54

  63. Liu W, Zhou P, Zhao Z, et al. K-BERT: Enabling language representation with knowledge graph. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 2901–2908

  64. Chi Z, Dong L, Wei F, et al. Cross-lingual natural language generation via pre-training. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 7570–7577

  65. Liu Y, Gu J, Goyal N, et al. Multilingual denoising pre-training for neural machine translation. ArXiv: 2001.08210

  66. Conneau A, Khandelwal K, Goyal N, et al. Unsupervised cross-lingual representation learning at scale. ArXiv: 1911.02116

  67. Ke P, Ji H, Liu S, et al. SentiLR: Linguistic knowledge enhanced language representation for sentiment analysis. ArXiv: 1911.02493

  68. Huang H, Liang Y, Duan N, et al. Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 2485–2494

  69. Eisenschlos J, Ruder S, Czapla P, et al. MultiFiT: Efficient multilingual language model fine-tuning. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 5701–5706

  70. Sun Y, Wang S, Li Y, et al. ERNIE: enhanced representation through knowledge integration. ArXiv: 1904.09223

  71. Cui Y, Che W, Liu T, et al. Pre-training with whole word masking for chinese BERT. ArXiv: 1906.08101

  72. Wei J, Ren X, Li X, et al. NEZHA: Neural contextualized representation for chinese language understanding. ArXiv: 1909.00204

  73. Diao S, Bai J, Song Y, et al. ZEN: Pre-training chinese text encoder enhanced by n-gram representations. ArXiv: 1911.00720

  74. Martin L, Müller B, Suárez P J O, et al. CamemBERT: A tasty French language model. ArXiv: 1911.03894

  75. Le H, Vial L, Frej J, et al. FlauBERT: Unsupervised language model pre-training for French. ArXiv: 1912.05372

  76. Delobelle P, Winters T, Berendt B. RobBERT: A Dutch RoBERTa-based language model. ArXiv: 2001.06286

  77. Lu J, Batra D, Parikh D, et al. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Proceedings of the Advances in Neural Information Processing Systems. Vancouver, 2019. 13–23

  78. Tan H, Bansal M. LXMERT: Learning cross-modality encoder representations from transformers. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 5099–5110

  79. Li L H, Yatskar M, Yin D, et al. VisualBERT: A simple and performant baseline for vision and language. ArXiv: 1908.03557

  80. Alberti C, Ling J, Collins M, et al. Fusion of detected objects in text for visual question answering. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 2131–2140

  81. Su W, Zhu X, Cao Y, et al. VL-BERT: Pre-training of generic visual-linguistic representations. In: Proceedings of the International Conference on Learning Representations. Addis Ababa, 2020

  82. Sun C, Myers A, Vondrick C, et al. VideoBERT: A joint model for video and language representation learning. In: Proceedings of the International Conference on Computer Vision. Seoul, 2019. 7463–7472

  83. Sun C, Baradel F, Murphy K, et al. Contrastive bidirectional transformer for temporal representation learning. ArXiv: 1906.05743

  84. Chuang Y, Liu C, Lee H. SpeechBERT: Cross-modal pre-trained language model for end-to-end spoken question answering. ArXiv: 1910.11559

  85. Lee J, Yoon W, Kim S, et al. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 2020, 36: 1234–1240

    Google Scholar 

  86. Beltagy I, Lo K, Cohan A. SciBERT: A pretrained language model for scientific text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 3613–3618

  87. Lee J and Hsiang J. PatentBERT: Patent classification with fine-tuning a pre-trained BERT model. ArXiv: 1906.02124

  88. Gordon M A, Duh K, Andrews N. Compressing BERT: Studying the effects of weight pruning on transfer learning. ArXiv: 2002.08307

  89. Shen S, Dong Z, Ye J, et al. Q-BERT: Hessian based ultra low precision quantization of BERT. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 8815–8821

  90. Zafrir O, Boudoukh G, Izsak P, et al. Q8BERT: Quantized 8bit BERT. ArXiv: 1910.06188

  91. Sanh V, Debut L, Chaumond J, et al. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. ArXiv: 1910.01108

  92. Jiao X, Yin Y, Shang L, et al. TinyBERT: Distilling BERT for natural language understanding. ArXiv: 1909.10351

  93. Wang W, Wei F, Dong L, et al. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. ArXiv: 2002.10957

  94. Xu C, Zhou W, Ge T, et al. BERT-of-Theseus: Compressing BERT by progressive module replacing. ArXiv: 2002.02925

  95. Mikolov T, Yih W, Zweig G. Linguistic regularities in continuous space word representations. In: Proceedings of the Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics. Atlanta, 2013. 746–751

  96. Rubinstein D, Levi E, Schwartz R, et al. How well do distributional models capture different types of semantic knowledge? In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Beijing, 2015. 726–730

  97. Gupta A, Boleda G, Baroni M, et al. Distributional vectors encode referential attributes. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Lisbon, 2015. 12–21

  98. Tenney I, Xia P, Chen B, et al. What do you learn from context? Probing for sentence structure in contextualized word representations. In: Proceedings of the International Conference on Learning Representations. New Orleans, 2019

  99. Liu N F, Gardner M, Belinkov Y, et al. Linguistic knowledge and transferability of contextual representations. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. 1073–1094

  100. Tenney I, Das D, Pavlick E. BERT rediscovers the classical NLP pipeline. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Florence, 2019. 4593–4601

  101. Goldberg Y. Assessing BERT’s syntactic abilities. ArXiv: 1901.05287

  102. Ettinger A. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Trans Associat Comput Linguist, 2020, 8: 34–48

    Article  Google Scholar 

  103. Hewitt J, Manning C D. A structural probe for finding syntax in word representations. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. 4129–4138

  104. Jawahar G, Sagot B, Seddah D. What does BERT learn about the structure of language? In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 3651–3657

  105. Kim T, Choi J, Edmiston D, et al. Are pre-trained language models aware of phrases? Simple but strong baselines for grammar induction. In: Proceedings of the International Conference on Learning Representations. Addis Ababa, 2020

  106. Reif E, Yuan A, Wattenberg M, et al. Visualizing and measuring the geometry of BERT. In: Proceedings of the Advances in Neural Information Processing Systems. Vancouver, 2019. 8592–8600

  107. Petroni F, Rocktäschel T, Riedel S, et al. Language models as knowledge bases? In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 2463–2473

  108. Jiang Z, Xu F F, Araki J, et al. How can we know what language models know? ArXiv: 1911.12543

  109. Pürner N, Waltinger U, Schütze H. BERT is not a knowledge base (yet): Factual knowledge vs. name-based reasoning in unsupervised QA. ArXiv: 1911.03681

  110. Kassner N, Schütze H. Negated LAMA: Birds cannot fly. ArXiv: 1911.03343

  111. Bouraoui Z, Camacho-Collados J, Schockaert S. Inducing relational knowledge from BERT. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 7456–7463

  112. Davison J, Feldman J, Rush A M. Commonsense knowledge mining from pretrained models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 1173–1178

  113. Lauscher A, Vulic I, Ponti E M, et al. Informing unsupervised pre-training with external linguistic knowledge. ArXiv: 1909.02339

  114. Wang R, Tang D, Duan N, et al. K-adapter: Infusing knowledge into pre-trained models with adapters. ArXiv: 2002.01808

  115. Levine Y, Lenz B, Dagan O, et al. SenseBERT: Driving some sense into BERT. ArXiv: 1908.05646

  116. Guan J, Huang F, Zhao Z, et al. A knowledge-enhanced pretraining model for commonsense story generation. ArXiv: 2001.05139

  117. He B, Zhou D, Xiao J, et al. Integrating graph contextualized knowledge into pre-trained language models. ArXiv: 1912.00147

  118. Wang Z, Zhang J, Feng J, et al. Knowledge graph and text jointly embedding. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Doha, 2014. 1591–1601

  119. Zhong H, Zhang J, Wang Z, et al. Aligning knowledge and text embeddings by entity descriptions. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Lisbon, 2015. 267–272

  120. Xie R, Liu Z, Jia J, et al. Representation learning of knowledge graphs with entity descriptions. In: Proceedings of the AAAI Conference on Artificial Intelligence. Phoenix, 2016. 2659–2665

  121. Xu J, Qiu X, Chen K, et al. Knowledge graph representation with jointly structural and textual encoding. In: Proceedings of the International Joint Conference on Artificial Intelligence. Melbourne, 2017. 1318–1324

  122. Yang A, Wang Q, Liu J, et al. Enhancing pre-trained language representations with rich knowledge for machine reading comprehension. In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 2346–2357

  123. Logan IV R L, Liu N F, Peters M E, et al. Barack’s wife hillary: Using knowledge graphs for fact-aware language modeling. In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 5962–5971

  124. Hayashi H, Hu Z, Xiong C, et al. Latent relation language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 7911–7918

  125. Faruqui M, Dyer C. Improving vector space word representations using multilingual correlation. In: Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics. Gothenburg, 2014. 462–471

  126. Luong M T, Pham H, Manning C D. Bilingual word representations with monolingual quality in mind. In: Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. Denver, 2015. 151–159

  127. Singla K, Can D, Narayanan S. A multi-task approach to learning multilingual representations. In: Proceedings of Annual Meeting of the Association for Computational Linguistics. Melbourne, 2018. 214–220

  128. Pires T, Schlinger E, Garrette D. How multilingual is multilingual BERT? In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Florence, 2019. 4996–5001

  129. K K, Wang Z, Mayhew S, et al. Cross-lingual ability of multilingual BERT: An empirical study. In: Proceedings of the International Conference on Learning Representations. Addis Ababa, 2020

  130. Virtanen A, Kanerva J, Ilo R, et al. Multilingual is not enough: BERT for Finnish. ArXiv: 1912.07076

  131. Sun Y, Wang S, Li Y, et al. ERNIE 2.0: A continual pre-training framework for language understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 8968–8975

  132. Kuratov Y, Arkhipov M. Adaptation of deep bidirectional multilingual transformers for russian language. ArXiv: 1905.07213

  133. Antoun W, Baly F, Hajj H. AraBERT: Transformer-based model for Arabic language understanding. ArXiv: 2003.00104

  134. Luo H, Ji L, Shi B, et al. UniViLM: A unified video and language pre-training model for multimodal understanding and generation. ArXiv: 2002.06353

  135. Li G, Duan N, Fang Y, et al. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 11336–11344

  136. Chen Y, Li L, Yu L, et al. UNITER: Learning universal image-text representations. ArXiv: 1909.11740

  137. Huang K, Altosaar J, Ranganath R. ClinicalBERT: Modeling clinical notes and predicting hospital readmission. ArXiv: 1904.05342

  138. Alsentzer E, Murphy J R, Boag W, et al. Publicly available clinical BERT embeddings. ArXiv: 1904.03323

  139. Ji Z, Wei Q, Xu H. BERT-based ranking for biomedical entity normalization. ArXiv: 1908.03548

  140. Tang M, Gandhi P, Kabir M A, et al. Progress notes classification and keyword extraction using attention-based deep learning models with BERT. ArXiv: 1910.05786

  141. Zhang J, Zhao Y, Saleh M, et al. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. ArXiv: 1912.08777

  142. Wang S, Che W, Liu Q, et al. Multi-task self-supervised learning for disfluency detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 9193–9200

  143. Bucilua C, Caruana R, Niculescu-Mizil A. Model compression. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining. Philadelphia, 2006. 535–541

  144. Ganesh P, Chen Y, Lou X, et al. Compressing large-scale transformer-based models: A case study on BERT. ArXiv: 2002.11985

  145. Dong Z, Yao Z, Gholami A, et al. Hawq: Hessian aware quantization of neural networks with mixed-precision. In: Proceedings of the International Conference on Computer Vision. Seoul, 2019. 293–302

  146. Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. ArXiv: 1503.02531

  147. Sun S, Cheng Y, Gan Z, et al. Patient knowledge distillation for BERT model compression. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 4323–4332

  148. Turc I, Chang M W, Lee K, et al. Well-read students learn better: The impact of student initialization on knowledge distillation. ArXiv: 1908.08962

  149. Sun Z, Yu H, Song X, et al. MobileBERT: A compact task-agnostic BERT for resource-limited devices. ArXiv: 2004.02984

  150. Zhao S, Gupta R, Song Y, et al. Extreme language model compression with optimal subwords and shared projections. ArXiv: 1909.11687

  151. Rogers A, Kovaleva O, Rumshisky A. A primer in BERTology: What we know about how BERT works. ArXiv: 2002.12327

  152. Michel P, Levy O, Neubig G. Are sixteen heads really better than one? In: Proceedings of the Advances in Neural Information Processing Systems. Vancouver, 2019. 14014–14024

  153. Voita E, Talbot D, Moiseev F, et al. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 5797–5808

  154. Dehghani M, Gouws S, Vinyals O, et al. Universal transformers. In: Proceedings of the International Conference on Learning Representations. New Orleans, 2019

  155. Lu W, Jiao J, Zhang R. TwinBERT: Distilling knowledge to twin-structured BERT models for efficient retrieval. ArXiv: 2002.06275

  156. Tsai H, Riesa J, Johnson M, et al. Small and practical BERT models for sequence labeling. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 3632–3636

  157. Liu X, He P, Chen W, et al. Improving multi-task deep neural networks via knowledge distillation for natural language understanding. ArXiv: 1904.09482

  158. Tang R, Lu Y, Liu L, et al. Distilling task-specific knowledge from BERT into simple neural networks. ArXiv: 1903.12136

  159. Chia Y K, Witteveen S, Andrews M. Transformer to CNN: Label-scarce distillation for efficient text classification. ArXiv: 1909.03508

  160. Liu W, Zhou P, Zhao Z, et al. FastBERT: A self-distilling BERT with adaptive inference time. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Online, 2020. 6035–6044

  161. Pan S J, Yang Q. A survey on transfer learning. IEEE Trans Knowledge Data Eng, 2009, 22, 1345–1359

    Article  Google Scholar 

  162. Belinkov Y, Durrani N, Dalvi F, et al. What do neural machine translation models learn about morphology? In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Vancouver, 2017. 861–872

  163. Peters M E, Ruder S, Smith N A. To tune or not to tune? Adapting pretrained representations to diverse tasks. In: Proceedings of the 4th Workshop on Representation Learning for NLP, RepL4NLP@ACL 2019. Florence, 2019. 7–14

  164. Zhong M, Liu P, Wang D, et al. Searching for effective neural extractive summarization: What works and what’s next. In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 1049–1058

  165. Zhu J, Xia Y, Wu L, et al. Incorporating BERT into neural machine translation. In: Proceedings of the International Conference on Learning Representations. Addis Ababa, 2020

  166. Dodge J, Ilharco G, Schwartz R, et al. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. ArXiv: 2002.06305

  167. Sun C, Qiu X, Xu Y, et al. How to fine-tune BERT for text classification? In: Proceedings of the China National Conference on Chinese Computational Linguistics. Kunming, 2019. 194–206

  168. Phang J, Févry T, Bowman S R. Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. ArXiv: 1811.01088

  169. Garg S, Vu T, Moschitti A. Tanda: Transfer and adapt pre-trained transformer models for answer sentence selection. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 7780–7788

  170. Li Z, Ding X, Liu T. Story ending prediction by transferable bert. In: Proceedings of the International Joint Conference on Artificial Intelligence. Macao, 2019. 1800–1806

  171. Liu X, He P, Chen W, et al. Multi-task deep neural networks for natural language understanding. In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 4487–4496

  172. Stickland A C, Murray I. BERT and PALs: Projected attention layers for efficient adaptation in multi-task learning. In: Proceedings of the International Conference on Machine Learning. Long Beach, 2019. 5986–5995

  173. Houlsby N, Giurgiu A, Jastrzebski S, et al. Parameter-efficient transfer learning for NLP. In: Proceedings of the International Conference on Machine Learning. Long Beach, 2019. 2790–2799

  174. Xu Y, Qiu X, Zhou L, et al. Improving BERT fine-tuning via self-ensemble and self-distillation. ArXiv: 2002.10345

  175. Chronopoulou A, Baziotis C, Potamianos A. An embarrassingly simple approach for transfer learning from pretrained language models. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. 2089–2095

  176. Li X L, Eisner J. Specializing word embeddings (for parsing) by information bottleneck. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 2744–2754

  177. Gardner M, Grus J, Neumann M, et al. Allennlp: A deep semantic natural language processing platform. ArXiv: 1803.07640

  178. Keskar N S, McCann B, Varshney L R, et al. CTRL: A conditional transformer language model for controllable generation. ArXiv: 1909.05858

  179. Vig J. A multiscale visualization of attention in the transformer model. In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 37–42

  180. Hoover B, Strobelt H, Gehrmann S. Exbert: A visual analysis tool to explore learned representations in transformers models. ArXiv: 1910.05276

  181. Yang Z, Cui Y, Chen Z, et al. Textbrewer: An open-source knowledge distillation toolkit for natural language processing. ArXiv: 2002.12620

  182. Wang Y, Hou Y, Che W, et al. From static to dynamic word representations: A survey. Int J Mach Learn Cyber, 2020, 11: 1611–1630

    Article  Google Scholar 

  183. Liu Q, Kusner M J, Blunsom P. A survey on contextual embeddings. ArXiv: 2003.07278

  184. Wang A, Singh A, Michael J, et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the International Conference on Learning Representations. New Orleans, 2019

  185. Wang A, Pruksachatkun Y, Nangia N, et al. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In: Proceedings of the Advances in Neural Information Processing Systems. Vancouver, 2019. 3261–3275

  186. Rajpurkar P, Zhang J, Lopyrev K, et al. Squad: 100, 000+ questions for machine comprehension of text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Austin, 2016. 2383–2392

  187. Reddy S, Chen D, Manning C D. CoQA: A conversational question answering challenge. Trans Associat Comput Linguist, 2019, 7: 249–266

    Article  Google Scholar 

  188. Yang Z, Qi P, Zhang S, et al. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Brussels, 2018. 2369–2380

  189. Zhang Z, Yang J, Zhao H. Retrospective reader for machine reading comprehension. ArXiv: 2001.09694

  190. Ju Y, Zhao F, Chen S, et al. Technical report on conversational question answering. ArXiv: 1909.10772

  191. Tu M, Huang K, Wang G, et al. Select, answer and explain: Interpretable multi-hop reading comprehension over multiple documents. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 9073–9080

  192. Bataa E, Wu J. An investigation of transfer learning-based sentiment analysis in Japanese. In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 4652–4657

  193. Sun C, Huang L, Qiu X. Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. 380–385

  194. Xu H, Liu B, Shu L, et al. BERT post-training for review reading comprehension and aspect-based sentiment analysis. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. 2324–2335

  195. Rietzler A, Stabinger S, Opitz P, et al. Adapt or get left behind: Domain adaptation through BERT language model finetuning for aspect-target sentiment classification. ArXiv: 1908.11860

  196. Karimi A, Rossi L, Prati A, et al. Adversarial training for aspect-based sentiment analysis with BERT. ArXiv: 2001.11316

  197. Song Y, Wang J, Liang Z, et al. Utilizing BERT intermediate layers for aspect based sentiment analysis and natural language inference. ArXiv: 2002.04815

  198. Li X, Bing L, Zhang W, et al. Exploiting BERT for end-to-end aspect-based sentiment analysis. In: Proceedings of the WNUT@Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 34–41

  199. Wu X, Zhang T, Zang L, et al. “Mask and infill”: Applying masked language model to sentiment transfer. ArXiv: 1908.08039

  200. Peters M E, Ammar W, Bhagavatula C, et al. Semi-supervised sequence tagging with bidirectional language models. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Vancouver, 2017. 1756–1765

  201. Liu L, Ren X, Shang J, et al. Efficient contextualized representation: Language model pruning for sequence labeling. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Brussels, 2018. 1215–1225

  202. Hakala K, Pyysalo S. Biomedical named entity recognition with multilingual BERT. In: Proceedings of the BioNLP Open Shared Tasks@Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 56–61

  203. Edunov S, Baevski A, Auli M. Pre-trained language model representations for language generation. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. 4052–4059

  204. Clinchant S, Jung K W, Nikoulina V. On the use of BERT for neural machine translation. In: Proceedings of the Proceedings of the 3rd Workshop on Neural Generation and Translation. Hong Kong, 2019. 108–117

  205. Imamura K, Sumita E. Recycling a pre-trained BERT encoder for neural machine translation. In: Proceedings of the 3rd Workshop on Neural Generation and Translation. Hong Kong, 2019. 23–31

  206. Zhang X, Wei F, Zhou M. HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization. In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 5059–5069

  207. Liu Y, Lapata M. Text summarization with pretrained encoders. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 3728–3738

  208. Zhong M, Liu P, Chen Y, et al. Extractive summarization as text matching. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Online, 2020. 6197–6208

  209. Jin D, Jin Z, Zhou J T, et al. Is BERT really robust? Natural language attack on text classification and entailment. In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, 2020. 8018–8025

  210. Wallace E, Feng S, Kandpal N, et al. Universal adversarial triggers for attacking and analyzing NLP. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Hong Kong, 2019. 2153–2162

  211. Sun L, Hashimoto K, Yin W, et al. Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT. ArXiv: 2003.04985

  212. Li L, Ma R, Guo Q, et al. BERT-ATTACK: Adversarial attack against BERT using BERT. ArXiv: 2004.09984

  213. Zhu C, Cheng Y, Gan Z, et al. FreeLB: Enhanced adversarial training for natural language understanding. In: Proceedings of the International Conference on Learning Representations. Addis Ababa, 2020

  214. Liu X, Cheng H, He P C, et al. Adversarial training for large neural language models. ArXiv: 2004.08994

  215. Shoeybi M, Patwary M, Puri R, et al. Megatron-LM: Training multibillion parameter language models using gpu model parallelism. ArXiv: 1909.08053

  216. Dai Z, Yang Z, Yang Y, et al. Transformer-XL: Attentive language models beyond a fixed-length context. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics. Florence, 2019. 2978–2988

  217. Zoph B, Le Q V. Neural architecture search with reinforcement learning. In: Proceedings of the International Conference on Learning Representations. Toulon, 2017

  218. Cheng Y, Wang D, Zhou P, et al. A survey of model compression and acceleration for deep neural networks. ArXiv: 1710.09282

  219. Wu X, Lv S, Zang L, et al. Conditional BERT contextual augmentation. In: Proceedings of the International Conference on Computational Science. Faro, 2019. 84–95

  220. Kumar V, Choudhary A, Cho E. Data augmentation using pre-trained transformer models. ArXiv: 2003.02245

  221. Barredo Arrieta A, Díaz-Rodríguez N, Del Ser J, et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf Fusion, 2020, 58: 82–115

    Article  Google Scholar 

  222. Jain S, Wallace B C. Attention is not explanation. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. 3543–3556

  223. Serrano S, Smith N A. Is attention interpretable? In: Proceedings of the Conference of the Association for Computational Linguistics. Florence, 2019. 2931–2951

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to XiPeng Qiu.

Additional information

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61751201 and 61672162), the Shanghai Municipal Science and Technology Major Project (Grant No. 2018SHZDZX01) and ZJLab. We thank Zhiyuan Liu, Wanxiang Che, Minlie Huang, Danqing Wang and Luyao Huang for their valuable feedback on this manuscript.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qiu, X., Sun, T., Xu, Y. et al. Pre-trained models for natural language processing: A survey. Sci. China Technol. Sci. 63, 1872–1897 (2020). https://doi.org/10.1007/s11431-020-1647-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11431-020-1647-3

Navigation