Skip to main content
Log in

Length-Based Curriculum Learning for Efficient Pre-training of Language Models

  • Published:
New Generation Computing Aims and scope Submit manuscript

Abstract

Recently, pre-trained language models (PLMs) have become core components in a wide range of natural language processing applications. However, PLMs like BERT and RoBERTa are typically trained with a large amount of unlabeled text corpora which requires extremely high computational cost. Curriculum learning (CL) is a learning strategy for training a model from easy samples to hard ones that has potential to alleviate this problem. Nevertheless, how to determine the difficulty measure of training samples for PLMs and an effective training scheduler are still open questions. In this study, we focus on the length of input text as the difficulty measure and propose a new CL approach called length-based CL. We analyze the effectiveness of the length-based difficulty measure in terms of convergence speed and GLUE scores using a limited amount of corpus. By combining maximum available batch size with the length-based difficulty measure, we show that our length-based CL model can achieve 1.5 times faster convergence speed in pre-training and better performances on downstream tasks. Furthermore, we expand the corpus to evaluate various pacing functions (training schedulers) for the length-based CL with respect to the computational time and generalization performance. Through experiments with a larger corpus, we find that our proposed Square scheduler achieved less computational time in pre-training and obtained the best generalization performance on downstream tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers, pp. 328–339 (2018)

  2. Dai, A.M., Le, Q.V.: Semi-supervised sequence learning. Adv. Neural Inf. Process. Syst. 28, 3079–3087 (2015)

    Google Scholar 

  3. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers), pp. 2227–2237 (2018)

  4. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)

  5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186 (2019)

  6. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized BERT pretraining approach (2019). arXiv preprint. arXiv:1907.11692

  7. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations (2020). arXiv:1909.11942 [cs]

  8. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019). arXiv preprint. arXiv:1910.01108

  9. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: pre-training text encoders as discriminators rather than generators (2020). arXiv preprint. arXiv:2003.10555

  10. Taylor, W.L.: “cloze procedure’’: a new tool for measuring readability. Journal. Mass Commun. Q. 30, 415–433 (1953)

    Google Scholar 

  11. Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5797–5808 (2019)

  12. Sukhbaatar, S., Grave, E., Bojanowski, P., Joulin, A.: Adaptive attention span in transformers. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 331–335 (2019)

  13. de Wynter, A., Perry, D.J.: Optimal subarchitecture extraction for BERT (2020). CoRR. arXiv:2010.10499

  14. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

  15. Elman, J.L.: Learning and development in neural networks: the importance of starting small. Cognition 48(1), 71–99 (1993)

    Article  Google Scholar 

  16. Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 41–48 (2009)

  17. Moore, R.C., Lewis, W.: Intelligent selection of language model training data. In: Proceedings of the ACL 2010 Conference Short Papers, pp. 220–224 (2010)

  18. Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., Smith, N.A.: Don’t stop pretraining: adapt language models to domains and tasks (2020). arXiv preprint. arXiv:2004.10964

  19. Soviany, P., Ionescu, R.T., Rota, P., Sebe, N.: Curriculum learning: a survey (2021). CoRR. arXiv:2101.10382

  20. Wang, X., Chen, Y., Zhu, W.: A survey on curriculum learning. IEEE Trans. Pattern Anal. Mach. Intell. 44(09), 4555–4576 (2022)

    Google Scholar 

  21. Shi, M., Ferrari, V.: Weakly supervised object localization using size estimates. In: European Conference on Computer Vision, pp. 105–121 (2016)

  22. Ionescu, R.T., Alexe, B., Leordeanu, M., Popescu, M., Papadopoulos, D.P., Ferrari, V.: How hard can it be? Estimating the difficulty of visual search in an image. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2157–2166 (2016)

  23. Spitkovsky, V.I., Alshawi, H., Jurafsky, D.: From baby steps to leapfrog: how “less is more” in unsupervised dependency parsing. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 751–759 (2010)

  24. Nagatsuka, K., Broni-Bediako, C., Atsumi, M.: Pre-training a BERT with curriculum learning by increasing block-size of input text. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 989–996 (2021)

  25. Hacohen, G., Weinshall, D.: On the power of curriculum learning in training deep networks. In: Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 2535–2544 (2019)

  26. Merity, S., Xiong, C., Bradbury, J., Socher, R.: Pointer sentinel mixture models (2016). arXiv preprint. arXiv:1609.07843

  27. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)

  28. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res 21(140), 1–67 (2020)

    MathSciNet  MATH  Google Scholar 

  29. Wu, X., Dyer, E., Neyshabur, B.: When do curricula work? In: International Conference on Learning Representations (2021)

  30. Li, C., Zhang, M., He, Y.: Curriculum learning: a regularization method for efficient and stable billion-scale GPT model pre-training (2021). arXiv:2108.06084

  31. Xu, B., Zhang, L., Mao, Z., Wang, Q., Xie, H., Zhang, Y.: Curriculum learning for natural language understanding. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6095–6104 (2020)

  32. Kocmi, T., Bojar, O.: Curriculum learning and minibatch bucketing in neural machine translation, pp. 379–386 (2017)

  33. Platanios, E.A., Stretcu, O., Neubig, G., Poczos, B., Mitchell, T.: Competence-based curriculum learning for neural machine translation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 1162–1172 (2019)

  34. Rajeswar, S., Subramanian, S., Dutil, F., Pal, C., Courville, A.: Adversarial generation of natural language (2017). arXiv preprint. arXiv:1705.10929

  35. Tay, Y., Wang, S., Luu, A.T., Fu, J., Phan, M.C., Yuan, X., Rao, J., Hui, S.C., Zhang, A.: Simple and effective curriculum pointer-generator networks for reading comprehension over long narratives (2019). arXiv:1905.10847

  36. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)

    Google Scholar 

  37. Penha, G., Hauff, C.: Curriculum learning strategies for IR: an empirical study on conversation response ranking (2019). arXiv:1912.08555

  38. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019)

  39. Marcus, M.P., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: the Penn Treebank. Comput. Linguist. 19(2), 313–330 (1993)

    Google Scholar 

  40. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355 (2018)

  41. Zhang, X., Kumar, G., Khayrallah, H., Murray, K., Gwinnup, J., Martindale, M.J., McNamee, P., Duh, K., Carpuat, M.: An empirical exploration of curriculum learning for neural machine translation (2018). arXiv preprint. arXiv:1811.00739

  42. Zhang, X., Shapiro, P., Kumar, G., McNamee, P., Carpuat, M., Duh, K.: Curriculum learning for domain adaptation in neural machine translation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 1903–1915 (2019)

  43. Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for squad. In: ACL (2018)

  44. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392 (2016)

Download references

Acknowledgements

This work was partially supported by Japan Science and Technology Agency (JST) under the SPRING scholarship program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Koichi Nagatsuka.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nagatsuka, K., Broni-Bediako, C. & Atsumi, M. Length-Based Curriculum Learning for Efficient Pre-training of Language Models. New Gener. Comput. 41, 109–134 (2023). https://doi.org/10.1007/s00354-022-00198-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00354-022-00198-8

Keywords

Navigation