Length-Based Curriculum Learning for Efficient Pre-training of Language Models

Nagatsuka, Koichi; Broni-Bediako, Clifford; Atsumi, Masayasu

doi:10.1007/s00354-022-00198-8

Length-Based Curriculum Learning for Efficient Pre-training of Language Models

Published: 27 December 2022

Volume 41, pages 109–134, (2023)
Cite this article

New Generation Computing Aims and scope Submit manuscript

Koichi Nagatsuka ORCID: orcid.org/0000-0003-0918-5294¹,
Clifford Broni-Bediako¹ &
Masayasu Atsumi¹

356 Accesses
Explore all metrics

Abstract

Recently, pre-trained language models (PLMs) have become core components in a wide range of natural language processing applications. However, PLMs like BERT and RoBERTa are typically trained with a large amount of unlabeled text corpora which requires extremely high computational cost. Curriculum learning (CL) is a learning strategy for training a model from easy samples to hard ones that has potential to alleviate this problem. Nevertheless, how to determine the difficulty measure of training samples for PLMs and an effective training scheduler are still open questions. In this study, we focus on the length of input text as the difficulty measure and propose a new CL approach called length-based CL. We analyze the effectiveness of the length-based difficulty measure in terms of convergence speed and GLUE scores using a limited amount of corpus. By combining maximum available batch size with the length-based difficulty measure, we show that our length-based CL model can achieve 1.5 times faster convergence speed in pre-training and better performances on downstream tasks. Furthermore, we expand the corpus to evaluate various pacing functions (training schedulers) for the length-based CL with respect to the computational time and generalization performance. Through experiments with a larger corpus, we find that our proposed Square scheduler achieved less computational time in pre-training and obtained the best generalization performance on downstream tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

A survey on large language model based autonomous agents

Article Open access 22 March 2024

Prompt Engineering in Large Language Models

References

Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers, pp. 328–339 (2018)
Dai, A.M., Le, Q.V.: Semi-supervised sequence learning. Adv. Neural Inf. Process. Syst. 28, 3079–3087 (2015)
Google Scholar
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers), pp. 2227–2237 (2018)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186 (2019)
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized BERT pretraining approach (2019). arXiv preprint. arXiv:1907.11692
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations (2020). arXiv:1909.11942 [cs]
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019). arXiv preprint. arXiv:1910.01108
Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: pre-training text encoders as discriminators rather than generators (2020). arXiv preprint. arXiv:2003.10555
Taylor, W.L.: “cloze procedure’’: a new tool for measuring readability. Journal. Mass Commun. Q. 30, 415–433 (1953)
Google Scholar
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5797–5808 (2019)
Sukhbaatar, S., Grave, E., Bojanowski, P., Joulin, A.: Adaptive attention span in transformers. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 331–335 (2019)
de Wynter, A., Perry, D.J.: Optimal subarchitecture extraction for BERT (2020). CoRR. arXiv:2010.10499
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Elman, J.L.: Learning and development in neural networks: the importance of starting small. Cognition 48(1), 71–99 (1993)
Article Google Scholar
Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 41–48 (2009)
Moore, R.C., Lewis, W.: Intelligent selection of language model training data. In: Proceedings of the ACL 2010 Conference Short Papers, pp. 220–224 (2010)
Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., Smith, N.A.: Don’t stop pretraining: adapt language models to domains and tasks (2020). arXiv preprint. arXiv:2004.10964
Soviany, P., Ionescu, R.T., Rota, P., Sebe, N.: Curriculum learning: a survey (2021). CoRR. arXiv:2101.10382
Wang, X., Chen, Y., Zhu, W.: A survey on curriculum learning. IEEE Trans. Pattern Anal. Mach. Intell. 44(09), 4555–4576 (2022)
Google Scholar
Shi, M., Ferrari, V.: Weakly supervised object localization using size estimates. In: European Conference on Computer Vision, pp. 105–121 (2016)
Ionescu, R.T., Alexe, B., Leordeanu, M., Popescu, M., Papadopoulos, D.P., Ferrari, V.: How hard can it be? Estimating the difficulty of visual search in an image. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2157–2166 (2016)
Spitkovsky, V.I., Alshawi, H., Jurafsky, D.: From baby steps to leapfrog: how “less is more” in unsupervised dependency parsing. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 751–759 (2010)
Nagatsuka, K., Broni-Bediako, C., Atsumi, M.: Pre-training a BERT with curriculum learning by increasing block-size of input text. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 989–996 (2021)
Hacohen, G., Weinshall, D.: On the power of curriculum learning in training deep networks. In: Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 2535–2544 (2019)
Merity, S., Xiong, C., Bradbury, J., Socher, R.: Pointer sentinel mixture models (2016). arXiv preprint. arXiv:1609.07843
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res 21(140), 1–67 (2020)
MathSciNet MATH Google Scholar
Wu, X., Dyer, E., Neyshabur, B.: When do curricula work? In: International Conference on Learning Representations (2021)
Li, C., Zhang, M., He, Y.: Curriculum learning: a regularization method for efficient and stable billion-scale GPT model pre-training (2021). arXiv:2108.06084
Xu, B., Zhang, L., Mao, Z., Wang, Q., Xie, H., Zhang, Y.: Curriculum learning for natural language understanding. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6095–6104 (2020)
Kocmi, T., Bojar, O.: Curriculum learning and minibatch bucketing in neural machine translation, pp. 379–386 (2017)
Platanios, E.A., Stretcu, O., Neubig, G., Poczos, B., Mitchell, T.: Competence-based curriculum learning for neural machine translation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 1162–1172 (2019)
Rajeswar, S., Subramanian, S., Dutil, F., Pal, C., Courville, A.: Adversarial generation of natural language (2017). arXiv preprint. arXiv:1705.10929
Tay, Y., Wang, S., Luu, A.T., Fu, J., Phan, M.C., Yuan, X., Rao, J., Hui, S.C., Zhang, A.: Simple and effective curriculum pointer-generator networks for reading comprehension over long narratives (2019). arXiv:1905.10847
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Google Scholar
Penha, G., Hauff, C.: Curriculum learning strategies for IR: an empirical study on conversation response ranking (2019). arXiv:1912.08555
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019)
Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: the Penn Treebank. Comput. Linguist. 19(2), 313–330 (1993)
Google Scholar
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355 (2018)
Zhang, X., Kumar, G., Khayrallah, H., Murray, K., Gwinnup, J., Martindale, M.J., McNamee, P., Duh, K., Carpuat, M.: An empirical exploration of curriculum learning for neural machine translation (2018). arXiv preprint. arXiv:1811.00739
Zhang, X., Shapiro, P., Kumar, G., McNamee, P., Carpuat, M., Duh, K.: Curriculum learning for domain adaptation in neural machine translation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 1903–1915 (2019)
Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for squad. In: ACL (2018)
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392 (2016)

Download references

Acknowledgements

This work was partially supported by Japan Science and Technology Agency (JST) under the SPRING scholarship program.

Author information

Authors and Affiliations

Graduate School of Science and Engineering, Soka University, Hachioji-shi, Tokyo, 192-8577, Japan
Koichi Nagatsuka, Clifford Broni-Bediako & Masayasu Atsumi

Authors

Koichi Nagatsuka
View author publications
You can also search for this author in PubMed Google Scholar
Clifford Broni-Bediako
View author publications
You can also search for this author in PubMed Google Scholar
Masayasu Atsumi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Koichi Nagatsuka.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

About this article

Cite this article

Nagatsuka, K., Broni-Bediako, C. & Atsumi, M. Length-Based Curriculum Learning for Efficient Pre-training of Language Models. New Gener. Comput. 41, 109–134 (2023). https://doi.org/10.1007/s00354-022-00198-8

Download citation

Received: 12 March 2022
Accepted: 02 December 2022
Published: 27 December 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s00354-022-00198-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Length-Based Curriculum Learning for Efficient Pre-training of Language Models

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A survey on large language model based autonomous agents

Prompt Engineering in Large Language Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

About this article

Cite this article

Keywords

Navigation

Length-Based Curriculum Learning for Efficient Pre-training of Language Models

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A survey on large language model based autonomous agents

Prompt Engineering in Large Language Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

About this article

Cite this article

Share this article

Keywords

Search

Navigation