Skip to main content
Log in

Reminding the incremental language model via data-free self-distillation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Incremental language learning, which involves retrieving pseudo-data from previous tasks, can alleviate catastrophic forgetting. However, previous methods require a large amount of pseudo-data to approach the performance of multitask learning, and the performance decreases dramatically when there is significantly less pseudo-data than new task data. This decrease occurs because the pseudo-data are learned inefficiently and deviate from the real data. To address these issues, we propose reminding the incremental language model via data-free self-distillation (DFSD), which includes 1) self-distillation based on the Earth mover’s distance (SD-EMD) and 2) hidden data augmentation (HDA). SD-EMD can increase the efficiency of the model by adaptively estimating the knowledge distribution in all GPT-2 layers and effectively transferring data from the teacher model to the student model via adaptive self-multilayer-to-multilayer mapping. HDA can reduce deviations by decomposing the generation process via data augmentation and bootstrapping. Our experiments on decaNLP and text classification tasks with low pseudo-data sampling ratios reveal that the DFSD model outperforms previous state-of-the-art incremental methods. The advantages of DFSD become more apparent when there is less pseudo-data and larger deviations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://github.com/jojotenya/LAMOL

References

  1. Ring MB (1997) CHILD: A first step towards continual learning. Mach Learn 28(1):77–104. https://doi.org/10.1023/A:1007331723572

    Article  MATH  Google Scholar 

  2. McCloskey M, Cohen NJ (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In: Psychology of learning and motivation, vol 24, Elsevier, pp 109–165

  3. French RM (1999) Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences 3(4):128–135

    Article  Google Scholar 

  4. Polikar R, Upda L, Upda SS, Honavar VG (2001) Learn++: an incremental learning algorithm for supervised neural networks. IEEE Trans Syst Man Cybern Part C 31(4):497–508. https://doi.org/10.1109/5326.983933

    Article  Google Scholar 

  5. Chen Z, Liu B (2018) Lifelong Machine Learning, Second Edition. Synthesis Lectures on Artificial Intelligence and Machine Learning Morgan & Claypool Publishers. https://doi.org/10.2200/S00832ED1V01Y201802AIM037

  6. Li Z, Hoiem D (2017) Learning without forgetting. IEEE Trans Pattern Anal Mach Intell 40 (12):2935–2947

    Article  Google Scholar 

  7. Lopez-Paz D, Ranzato M (2017) Gradient episodic memory for continual learning. In: Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R (eds) Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 6467–6476. https://proceedings.neurips.cc/paper/2017/hash/f87522788a2be2d171666752f97ddebb-Abstract.html

  8. Chaudhry A, Ranzato M, Rohrbach M, Elhoseiny M (2019) Efficient lifelong learning with A-GEM. In: 7th International conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=Hkf2_sC5FX

  9. de Masson d’Autume C, Ruder S, Kong L, Yogatama D (2019) Episodic memory in lifelong language learning. In: Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R (eds) Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp 13122–13131. https://proceedings.neurips.cc/paper/2019/hash/f8d2e80c1458ea2501f98a2cafadb397-Abstract.html

  10. Wang Z, Mehta SV, Póczos B., Carbonell J (2020) Efficient meta lifelong-learning with limited memory. In: EMNLP

  11. Sun FK, Ho CH, Lee HY (2019) Lamol: Language modeling for lifelong language learning. Proceedings of the ICLR 2020

  12. Chuang YS, Su SY, Chen YN (2020) Lifelong language knowledge distillation. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). https://doi.org/10.18653/v1/2020.emnlp-main.233. Association for Computational Linguistics, Online, pp 2914–2924

  13. Sun J, Wang S, Zhang J, Zong C (2020) Distill and replay for continual language learning. In: Proceedings of the 28th international conference on computational linguistics. https://doi.org/10.18653/v1/2020.coling-main.318. International Committee on Computational Linguistics, Barcelona, Spain (Online), pp 3569–3579

  14. Kanwatchara K, Horsuwan T, Lertvittayakumjorn P, Kijsirikul B, Vateekul P (2021) Rational LAMOL: a rationale-based lifelong learning framework. In: Zong C, Xia F, Li W, Navigli R (eds) Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, Association for Computational Linguistics, pp 2942–2953. https://doi.org/10.18653/v1/2021.acl-long.229

  15. Li C, Li Y, Zhao Y, Peng P, Geng X (2021) SLER: Self-generated long-term experience replay for continual reinforcement learning. Appl Intell 51(1):185–201. https://doi.org/10.1007/s10489-020-01786-1

    Article  Google Scholar 

  16. McCann B, Keskar NS, Xiong C, Socher R (2018) The natural language decathlon:, Multitask learning as question answering. arXiv:1806.08730

  17. Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9

    Google Scholar 

  18. Furlanello T, Lipton Z, Tschannen M, Itti L, Anandkumar A (2018) Born again neural networks. In: International conference on machine learning, PMLR, pp 1607–1616

  19. Arazo E, Ortego D, Albert P, O’Connor N, McGuinness K (2019) Unsupervised label noise modeling and loss correction. In: International conference on machine learning, PMLR, pp 312–321

  20. Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric for image retrieval. Int J Comput Vis 40(2):99–121

    Article  MATH  Google Scholar 

  21. Parisi GI, Kemker R, Part JL, Kanan C, Wermter S (2019) Continual lifelong learning with neural networks: a review. Neural Netw 113:54–71. https://doi.org/10.1016/j.neunet.2019.01.012

    Article  Google Scholar 

  22. Heinrich S, Yao Y, Hinz T, Liu Z, Hummel T, Kerzel M, Weber C, Wermter S (2020) Crossmodal language grounding in an embodied neurocognitive model. Frontiers Neurorobotics 14:52. https://doi.org/10.3389/fnbot.2020.00052

    Article  Google Scholar 

  23. Capuano N, Greco L, Ritrovato P, Vento M (2021) Sentiment analysis for customer relationship management: an incremental learning approach. Appl Intell 51(6):3339–3352. https://doi.org/10.1007/s10489-020-01984-x

    Article  Google Scholar 

  24. Cossu A, Carta A, Lomonaco V, Bacciu D (2021) Continual learning for recurrent neural networks: an empirical evaluation. Neural Netw 143:607–627. https://doi.org/10.1016/j.neunet.2021.07.021

    Article  Google Scholar 

  25. Shin H, Lee JK, Kim J, Kim J (2017) Continual learning with deep generative replay. In: Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R (eds) Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 2990–2999. https://proceedings.neurips.cc/paper/2017/hash/0efbe98067c6c73dba1250d2beaa81f9-Abstract.html

  26. Kemker R, Kanan C (2018) Fearnet: Brain-inspired model for incremental learning. In: 6th International conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=SJ1Xmf-Rb

  27. Schwarz J, Czarnecki W, Luketina J, Grabska-Barwinska A, Teh YW, Pascanu R, Hadsell R (2018) Progress & compress: A scalable framework for continual learning. In: Dy JG, Krause A (eds) Proceedings of the 35th international conference on machine learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Proceedings of Machine Learning Research, vol 80, PMLR, pp 4535–4544. http://proceedings.mlr.press/v80/schwarz18a.html

  28. Zhai M, Chen L, Tung F, He J, Nawhal M, Mori G (2019) Lifelong gan: Continual learning for conditional image generation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2759–2768

  29. van de Ven GM, Siegelmann HT, Tolias AS (2020) Brain-inspired replay for continual learning with artificial neural networks. Nat Commun 11. https://doi.org/10.1038/s41467-020-17866-2

  30. Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A et al (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114(13):3521–3526

    Article  MathSciNet  MATH  Google Scholar 

  31. Zenke F, Poole B, Ganguli S (2017) Continual learning through synaptic intelligence. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, Proceedings of Machine Learning Research, vol 70, PMLR, pp 3987–3995. http://proceedings.mlr.press/v70/zenke17a.html

  32. Aljundi R, Babiloni F, Elhoseiny M, Rohrbach M, Tuytelaars T (2018) Memory aware synapses: Learning what (not) to forget. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 139–154

  33. Lee S, Kim J, Jun J, Ha J, Zhang B (2017) Overcoming catastrophic forgetting by incremental moment matching. In: Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R (eds) Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 4652–4662. https://proceedings.neurips.cc/paper/2017/hash/f708f064faaf32a43e4d3c784e6af9ea-Abstract.html

  34. nostalgebraist (2020) Interpreting gpt: the logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens. Accessed 31st Aug 2020

  35. Alammar J (2021) Finding the words to say: Hidden state visualizations for language models. https://jalammar.github.io/hidden-states/

  36. Li J, Liu X, Zhao H, Xu R, Yang M, Jin Y (2020) BERT-EMD: Many-to-many layer mapping for BERT compression with earth mover’s distance. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), Association for Computational Linguistics, Online, pp 3009–3018. https://doi.org/10.18653/v1/2020.emnlp-main.242, https://www.aclweb.org/anthology/2020.emnlp-main.242

  37. Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, pp 4171–4186. https://doi.org/10.18653/v1/n19-1423

  38. Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2021) Understanding deep learning (still) requires rethinking generalization. Commun ACM 64:107–115

    Article  Google Scholar 

  39. Reed SE, Lee H, Anguelov D, Szegedy C, Erhan D, Rabinovich A (2015) Training deep neural networks on noisy labels with bootstrapping. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings. 1412.6596

  40. Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, July 6-12, 2002, Philadelphia, PA, USA, ACL, pp 311–318. https://doi.org/10.3115/1073083.1073135, https://aclanthology.org/P02-1040/

  41. Zhang X, Zhao JJ, LeCun Y (2015) Character-level convolutional networks for text classification. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28: annual conference on neural information processing systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp 649–657 . https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qingwei Zhao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Examples of Real Samples and Pseudo-samples

Table 10 Examples of real samples and pseudo samples

Appendix B: Text Classification

For our DFSD method, we show the order of tasks and the corresponding training curves for the classification tasks shown in Fig. 8.

Fig. 8
figure 8

The training curves for the text classification tasks. The graph plots the performance of the model in each epoch for each task

Appendix C: Visualization of Self-Distillation Based on EMD

For the three decaNLP tasks, we show the IKDs of the hidden states and attention layer during incremental language learning for each task for all learning orders when the sampling ratio γ = 0.05 in Fig. 9.

Fig. 9
figure 9

The IKDs of the three decaNLP tasks. We show the IKDs of the hidden state and attention layer during incremental language learning for each task for all learning orders when the sampling ratio γ = 0.05

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, H., Fu, R., Li, C. et al. Reminding the incremental language model via data-free self-distillation. Appl Intell 53, 9298–9320 (2023). https://doi.org/10.1007/s10489-022-03678-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03678-y

Keywords

Navigation