Abstract
Incremental language learning, which involves retrieving pseudo-data from previous tasks, can alleviate catastrophic forgetting. However, previous methods require a large amount of pseudo-data to approach the performance of multitask learning, and the performance decreases dramatically when there is significantly less pseudo-data than new task data. This decrease occurs because the pseudo-data are learned inefficiently and deviate from the real data. To address these issues, we propose reminding the incremental language model via data-free self-distillation (DFSD), which includes 1) self-distillation based on the Earth mover’s distance (SD-EMD) and 2) hidden data augmentation (HDA). SD-EMD can increase the efficiency of the model by adaptively estimating the knowledge distribution in all GPT-2 layers and effectively transferring data from the teacher model to the student model via adaptive self-multilayer-to-multilayer mapping. HDA can reduce deviations by decomposing the generation process via data augmentation and bootstrapping. Our experiments on decaNLP and text classification tasks with low pseudo-data sampling ratios reveal that the DFSD model outperforms previous state-of-the-art incremental methods. The advantages of DFSD become more apparent when there is less pseudo-data and larger deviations.
Similar content being viewed by others
References
Ring MB (1997) CHILD: A first step towards continual learning. Mach Learn 28(1):77–104. https://doi.org/10.1023/A:1007331723572
McCloskey M, Cohen NJ (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In: Psychology of learning and motivation, vol 24, Elsevier, pp 109–165
French RM (1999) Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences 3(4):128–135
Polikar R, Upda L, Upda SS, Honavar VG (2001) Learn++: an incremental learning algorithm for supervised neural networks. IEEE Trans Syst Man Cybern Part C 31(4):497–508. https://doi.org/10.1109/5326.983933
Chen Z, Liu B (2018) Lifelong Machine Learning, Second Edition. Synthesis Lectures on Artificial Intelligence and Machine Learning Morgan & Claypool Publishers. https://doi.org/10.2200/S00832ED1V01Y201802AIM037
Li Z, Hoiem D (2017) Learning without forgetting. IEEE Trans Pattern Anal Mach Intell 40 (12):2935–2947
Lopez-Paz D, Ranzato M (2017) Gradient episodic memory for continual learning. In: Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R (eds) Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 6467–6476. https://proceedings.neurips.cc/paper/2017/hash/f87522788a2be2d171666752f97ddebb-Abstract.html
Chaudhry A, Ranzato M, Rohrbach M, Elhoseiny M (2019) Efficient lifelong learning with A-GEM. In: 7th International conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=Hkf2_sC5FX
de Masson d’Autume C, Ruder S, Kong L, Yogatama D (2019) Episodic memory in lifelong language learning. In: Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R (eds) Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp 13122–13131. https://proceedings.neurips.cc/paper/2019/hash/f8d2e80c1458ea2501f98a2cafadb397-Abstract.html
Wang Z, Mehta SV, Póczos B., Carbonell J (2020) Efficient meta lifelong-learning with limited memory. In: EMNLP
Sun FK, Ho CH, Lee HY (2019) Lamol: Language modeling for lifelong language learning. Proceedings of the ICLR 2020
Chuang YS, Su SY, Chen YN (2020) Lifelong language knowledge distillation. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). https://doi.org/10.18653/v1/2020.emnlp-main.233. Association for Computational Linguistics, Online, pp 2914–2924
Sun J, Wang S, Zhang J, Zong C (2020) Distill and replay for continual language learning. In: Proceedings of the 28th international conference on computational linguistics. https://doi.org/10.18653/v1/2020.coling-main.318. International Committee on Computational Linguistics, Barcelona, Spain (Online), pp 3569–3579
Kanwatchara K, Horsuwan T, Lertvittayakumjorn P, Kijsirikul B, Vateekul P (2021) Rational LAMOL: a rationale-based lifelong learning framework. In: Zong C, Xia F, Li W, Navigli R (eds) Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, Association for Computational Linguistics, pp 2942–2953. https://doi.org/10.18653/v1/2021.acl-long.229
Li C, Li Y, Zhao Y, Peng P, Geng X (2021) SLER: Self-generated long-term experience replay for continual reinforcement learning. Appl Intell 51(1):185–201. https://doi.org/10.1007/s10489-020-01786-1
McCann B, Keskar NS, Xiong C, Socher R (2018) The natural language decathlon:, Multitask learning as question answering. arXiv:1806.08730
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9
Furlanello T, Lipton Z, Tschannen M, Itti L, Anandkumar A (2018) Born again neural networks. In: International conference on machine learning, PMLR, pp 1607–1616
Arazo E, Ortego D, Albert P, O’Connor N, McGuinness K (2019) Unsupervised label noise modeling and loss correction. In: International conference on machine learning, PMLR, pp 312–321
Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric for image retrieval. Int J Comput Vis 40(2):99–121
Parisi GI, Kemker R, Part JL, Kanan C, Wermter S (2019) Continual lifelong learning with neural networks: a review. Neural Netw 113:54–71. https://doi.org/10.1016/j.neunet.2019.01.012
Heinrich S, Yao Y, Hinz T, Liu Z, Hummel T, Kerzel M, Weber C, Wermter S (2020) Crossmodal language grounding in an embodied neurocognitive model. Frontiers Neurorobotics 14:52. https://doi.org/10.3389/fnbot.2020.00052
Capuano N, Greco L, Ritrovato P, Vento M (2021) Sentiment analysis for customer relationship management: an incremental learning approach. Appl Intell 51(6):3339–3352. https://doi.org/10.1007/s10489-020-01984-x
Cossu A, Carta A, Lomonaco V, Bacciu D (2021) Continual learning for recurrent neural networks: an empirical evaluation. Neural Netw 143:607–627. https://doi.org/10.1016/j.neunet.2021.07.021
Shin H, Lee JK, Kim J, Kim J (2017) Continual learning with deep generative replay. In: Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R (eds) Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 2990–2999. https://proceedings.neurips.cc/paper/2017/hash/0efbe98067c6c73dba1250d2beaa81f9-Abstract.html
Kemker R, Kanan C (2018) Fearnet: Brain-inspired model for incremental learning. In: 6th International conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=SJ1Xmf-Rb
Schwarz J, Czarnecki W, Luketina J, Grabska-Barwinska A, Teh YW, Pascanu R, Hadsell R (2018) Progress & compress: A scalable framework for continual learning. In: Dy JG, Krause A (eds) Proceedings of the 35th international conference on machine learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Proceedings of Machine Learning Research, vol 80, PMLR, pp 4535–4544. http://proceedings.mlr.press/v80/schwarz18a.html
Zhai M, Chen L, Tung F, He J, Nawhal M, Mori G (2019) Lifelong gan: Continual learning for conditional image generation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2759–2768
van de Ven GM, Siegelmann HT, Tolias AS (2020) Brain-inspired replay for continual learning with artificial neural networks. Nat Commun 11. https://doi.org/10.1038/s41467-020-17866-2
Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A et al (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114(13):3521–3526
Zenke F, Poole B, Ganguli S (2017) Continual learning through synaptic intelligence. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, Proceedings of Machine Learning Research, vol 70, PMLR, pp 3987–3995. http://proceedings.mlr.press/v70/zenke17a.html
Aljundi R, Babiloni F, Elhoseiny M, Rohrbach M, Tuytelaars T (2018) Memory aware synapses: Learning what (not) to forget. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 139–154
Lee S, Kim J, Jun J, Ha J, Zhang B (2017) Overcoming catastrophic forgetting by incremental moment matching. In: Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R (eds) Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 4652–4662. https://proceedings.neurips.cc/paper/2017/hash/f708f064faaf32a43e4d3c784e6af9ea-Abstract.html
nostalgebraist (2020) Interpreting gpt: the logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens. Accessed 31st Aug 2020
Alammar J (2021) Finding the words to say: Hidden state visualizations for language models. https://jalammar.github.io/hidden-states/
Li J, Liu X, Zhao H, Xu R, Yang M, Jin Y (2020) BERT-EMD: Many-to-many layer mapping for BERT compression with earth mover’s distance. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), Association for Computational Linguistics, Online, pp 3009–3018. https://doi.org/10.18653/v1/2020.emnlp-main.242, https://www.aclweb.org/anthology/2020.emnlp-main.242
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, pp 4171–4186. https://doi.org/10.18653/v1/n19-1423
Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2021) Understanding deep learning (still) requires rethinking generalization. Commun ACM 64:107–115
Reed SE, Lee H, Anguelov D, Szegedy C, Erhan D, Rabinovich A (2015) Training deep neural networks on noisy labels with bootstrapping. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings. 1412.6596
Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, July 6-12, 2002, Philadelphia, PA, USA, ACL, pp 311–318. https://doi.org/10.3115/1073083.1073135, https://aclanthology.org/P02-1040/
Zhang X, Zhao JJ, LeCun Y (2015) Character-level convolutional networks for text classification. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28: annual conference on neural information processing systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp 649–657 . https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Examples of Real Samples and Pseudo-samples
Appendix B: Text Classification
For our DFSD method, we show the order of tasks and the corresponding training curves for the classification tasks shown in Fig. 8.
Appendix C: Visualization of Self-Distillation Based on EMD
For the three decaNLP tasks, we show the IKDs of the hidden states and attention layer during incremental language learning for each task for all learning orders when the sampling ratio γ = 0.05 in Fig. 9.
Rights and permissions
About this article
Cite this article
Wang, H., Fu, R., Li, C. et al. Reminding the incremental language model via data-free self-distillation. Appl Intell 53, 9298–9320 (2023). https://doi.org/10.1007/s10489-022-03678-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03678-y