Reminding the incremental language model via data-free self-distillation

Wang, Han; Fu, Ruiliu; Li, Chengzhang; Zhang, Xuejun; Zhou, Jun; Bai, Xing; Yan, Yonghong; Zhao, Qingwei

doi:10.1007/s10489-022-03678-y

Reminding the incremental language model via data-free self-distillation

Published: 08 August 2022

Volume 53, pages 9298–9320, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Han Wang^1,2,
Ruiliu Fu^1,2,
Chengzhang Li^1,2,
Xuejun Zhang^1,2,
Jun Zhou^1,2,
Xing Bai^1,2,
Yonghong Yan^1,2 &
…
Qingwei Zhao ORCID: orcid.org/0000-0001-9272-2614^1,2

272 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Incremental language learning, which involves retrieving pseudo-data from previous tasks, can alleviate catastrophic forgetting. However, previous methods require a large amount of pseudo-data to approach the performance of multitask learning, and the performance decreases dramatically when there is significantly less pseudo-data than new task data. This decrease occurs because the pseudo-data are learned inefficiently and deviate from the real data. To address these issues, we propose reminding the incremental language model via data-free self-distillation (DFSD), which includes 1) self-distillation based on the Earth mover’s distance (SD-EMD) and 2) hidden data augmentation (HDA). SD-EMD can increase the efficiency of the model by adaptively estimating the knowledge distribution in all GPT-2 layers and effectively transferring data from the teacher model to the student model via adaptive self-multilayer-to-multilayer mapping. HDA can reduce deviations by decomposing the generation process via data augmentation and bootstrapping. Our experiments on decaNLP and text classification tasks with low pseudo-data sampling ratios reveal that the DFSD model outperforms previous state-of-the-art incremental methods. The advantages of DFSD become more apparent when there is less pseudo-data and larger deviations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

2RDA: Representation and Relation Distillation with Data Augmentation

DRILL: Dynamic Representations for Imbalanced Lifelong Learning

AUG-BERT: An Efficient Data Augmentation Algorithm for Text Classification

Notes

https://github.com/jojotenya/LAMOL

References

Ring MB (1997) CHILD: A first step towards continual learning. Mach Learn 28(1):77–104. https://doi.org/10.1023/A:1007331723572
Article MATH Google Scholar
McCloskey M, Cohen NJ (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In: Psychology of learning and motivation, vol 24, Elsevier, pp 109–165
French RM (1999) Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences 3(4):128–135
Article Google Scholar
Polikar R, Upda L, Upda SS, Honavar VG (2001) Learn++: an incremental learning algorithm for supervised neural networks. IEEE Trans Syst Man Cybern Part C 31(4):497–508. https://doi.org/10.1109/5326.983933
Article Google Scholar
Chen Z, Liu B (2018) Lifelong Machine Learning, Second Edition. Synthesis Lectures on Artificial Intelligence and Machine Learning Morgan & Claypool Publishers. https://doi.org/10.2200/S00832ED1V01Y201802AIM037
Li Z, Hoiem D (2017) Learning without forgetting. IEEE Trans Pattern Anal Mach Intell 40 (12):2935–2947
Article Google Scholar
Lopez-Paz D, Ranzato M (2017) Gradient episodic memory for continual learning. In: Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R (eds) Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 6467–6476. https://proceedings.neurips.cc/paper/2017/hash/f87522788a2be2d171666752f97ddebb-Abstract.html
Chaudhry A, Ranzato M, Rohrbach M, Elhoseiny M (2019) Efficient lifelong learning with A-GEM. In: 7th International conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=Hkf2_sC5FX
de Masson d’Autume C, Ruder S, Kong L, Yogatama D (2019) Episodic memory in lifelong language learning. In: Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R (eds) Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp 13122–13131. https://proceedings.neurips.cc/paper/2019/hash/f8d2e80c1458ea2501f98a2cafadb397-Abstract.html
Wang Z, Mehta SV, Póczos B., Carbonell J (2020) Efficient meta lifelong-learning with limited memory. In: EMNLP
Sun FK, Ho CH, Lee HY (2019) Lamol: Language modeling for lifelong language learning. Proceedings of the ICLR 2020
Chuang YS, Su SY, Chen YN (2020) Lifelong language knowledge distillation. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). https://doi.org/10.18653/v1/2020.emnlp-main.233. Association for Computational Linguistics, Online, pp 2914–2924
Sun J, Wang S, Zhang J, Zong C (2020) Distill and replay for continual language learning. In: Proceedings of the 28th international conference on computational linguistics. https://doi.org/10.18653/v1/2020.coling-main.318. International Committee on Computational Linguistics, Barcelona, Spain (Online), pp 3569–3579
Kanwatchara K, Horsuwan T, Lertvittayakumjorn P, Kijsirikul B, Vateekul P (2021) Rational LAMOL: a rationale-based lifelong learning framework. In: Zong C, Xia F, Li W, Navigli R (eds) Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, Association for Computational Linguistics, pp 2942–2953. https://doi.org/10.18653/v1/2021.acl-long.229
Li C, Li Y, Zhao Y, Peng P, Geng X (2021) SLER: Self-generated long-term experience replay for continual reinforcement learning. Appl Intell 51(1):185–201. https://doi.org/10.1007/s10489-020-01786-1
Article Google Scholar
McCann B, Keskar NS, Xiong C, Socher R (2018) The natural language decathlon:, Multitask learning as question answering. arXiv:1806.08730
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9
Google Scholar
Furlanello T, Lipton Z, Tschannen M, Itti L, Anandkumar A (2018) Born again neural networks. In: International conference on machine learning, PMLR, pp 1607–1616
Arazo E, Ortego D, Albert P, O’Connor N, McGuinness K (2019) Unsupervised label noise modeling and loss correction. In: International conference on machine learning, PMLR, pp 312–321
Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric for image retrieval. Int J Comput Vis 40(2):99–121
Article MATH Google Scholar
Parisi GI, Kemker R, Part JL, Kanan C, Wermter S (2019) Continual lifelong learning with neural networks: a review. Neural Netw 113:54–71. https://doi.org/10.1016/j.neunet.2019.01.012
Article Google Scholar
Heinrich S, Yao Y, Hinz T, Liu Z, Hummel T, Kerzel M, Weber C, Wermter S (2020) Crossmodal language grounding in an embodied neurocognitive model. Frontiers Neurorobotics 14:52. https://doi.org/10.3389/fnbot.2020.00052
Article Google Scholar
Capuano N, Greco L, Ritrovato P, Vento M (2021) Sentiment analysis for customer relationship management: an incremental learning approach. Appl Intell 51(6):3339–3352. https://doi.org/10.1007/s10489-020-01984-x
Article Google Scholar
Cossu A, Carta A, Lomonaco V, Bacciu D (2021) Continual learning for recurrent neural networks: an empirical evaluation. Neural Netw 143:607–627. https://doi.org/10.1016/j.neunet.2021.07.021
Article Google Scholar
Shin H, Lee JK, Kim J, Kim J (2017) Continual learning with deep generative replay. In: Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R (eds) Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 2990–2999. https://proceedings.neurips.cc/paper/2017/hash/0efbe98067c6c73dba1250d2beaa81f9-Abstract.html
Kemker R, Kanan C (2018) Fearnet: Brain-inspired model for incremental learning. In: 6th International conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=SJ1Xmf-Rb
Schwarz J, Czarnecki W, Luketina J, Grabska-Barwinska A, Teh YW, Pascanu R, Hadsell R (2018) Progress & compress: A scalable framework for continual learning. In: Dy JG, Krause A (eds) Proceedings of the 35th international conference on machine learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Proceedings of Machine Learning Research, vol 80, PMLR, pp 4535–4544. http://proceedings.mlr.press/v80/schwarz18a.html
Zhai M, Chen L, Tung F, He J, Nawhal M, Mori G (2019) Lifelong gan: Continual learning for conditional image generation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2759–2768
van de Ven GM, Siegelmann HT, Tolias AS (2020) Brain-inspired replay for continual learning with artificial neural networks. Nat Commun 11. https://doi.org/10.1038/s41467-020-17866-2
Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A et al (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114(13):3521–3526
Article MathSciNet MATH Google Scholar
Zenke F, Poole B, Ganguli S (2017) Continual learning through synaptic intelligence. In: Precup D, Teh YW (eds) Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, Proceedings of Machine Learning Research, vol 70, PMLR, pp 3987–3995. http://proceedings.mlr.press/v70/zenke17a.html
Aljundi R, Babiloni F, Elhoseiny M, Rohrbach M, Tuytelaars T (2018) Memory aware synapses: Learning what (not) to forget. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 139–154
Lee S, Kim J, Jun J, Ha J, Zhang B (2017) Overcoming catastrophic forgetting by incremental moment matching. In: Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R (eds) Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 4652–4662. https://proceedings.neurips.cc/paper/2017/hash/f708f064faaf32a43e4d3c784e6af9ea-Abstract.html
nostalgebraist (2020) Interpreting gpt: the logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens. Accessed 31st Aug 2020
Alammar J (2021) Finding the words to say: Hidden state visualizations for language models. https://jalammar.github.io/hidden-states/
Li J, Liu X, Zhao H, Xu R, Yang M, Jin Y (2020) BERT-EMD: Many-to-many layer mapping for BERT compression with earth mover’s distance. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), Association for Computational Linguistics, Online, pp 3009–3018. https://doi.org/10.18653/v1/2020.emnlp-main.242, https://www.aclweb.org/anthology/2020.emnlp-main.242
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, pp 4171–4186. https://doi.org/10.18653/v1/n19-1423
Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2021) Understanding deep learning (still) requires rethinking generalization. Commun ACM 64:107–115
Article Google Scholar
Reed SE, Lee H, Anguelov D, Szegedy C, Erhan D, Rabinovich A (2015) Training deep neural networks on noisy labels with bootstrapping. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings. 1412.6596
Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, July 6-12, 2002, Philadelphia, PA, USA, ACL, pp 311–318. https://doi.org/10.3115/1073083.1073135, https://aclanthology.org/P02-1040/
Zhang X, Zhao JJ, LeCun Y (2015) Character-level convolutional networks for text classification. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28: annual conference on neural information processing systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp 649–657 . https://proceedings.neurips.cc/paper/2015/hash/250cf8b51c773f3f8dc8b4be867a9a02-Abstract.html

Download references

Author information

Authors and Affiliations

Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
Han Wang, Ruiliu Fu, Chengzhang Li, Xuejun Zhang, Jun Zhou, Xing Bai, Yonghong Yan & Qingwei Zhao
University of Chinese Academy of Sciences, Beijing, China
Han Wang, Ruiliu Fu, Chengzhang Li, Xuejun Zhang, Jun Zhou, Xing Bai, Yonghong Yan & Qingwei Zhao

Authors

Han Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ruiliu Fu
View author publications
You can also search for this author in PubMed Google Scholar
Chengzhang Li
View author publications
You can also search for this author in PubMed Google Scholar
Xuejun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xing Bai
View author publications
You can also search for this author in PubMed Google Scholar
Yonghong Yan
View author publications
You can also search for this author in PubMed Google Scholar
Qingwei Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingwei Zhao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Examples of Real Samples and Pseudo-samples

Table 10 Examples of real samples and pseudo samples

Full size table

Appendix B: Text Classification

For our DFSD method, we show the order of tasks and the corresponding training curves for the classification tasks shown in Fig. 8.

Appendix C: Visualization of Self-Distillation Based on EMD

For the three decaNLP tasks, we show the IKDs of the hidden states and attention layer during incremental language learning for each task for all learning orders when the sampling ratio γ = 0.05 in Fig. 9.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Fu, R., Li, C. et al. Reminding the incremental language model via data-free self-distillation. Appl Intell 53, 9298–9320 (2023). https://doi.org/10.1007/s10489-022-03678-y

Download citation

Accepted: 22 April 2022
Published: 08 August 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s10489-022-03678-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reminding the incremental language model via data-free self-distillation

Abstract

Access this article

Similar content being viewed by others

2RDA: Representation and Relation Distillation with Data Augmentation

DRILL: Dynamic Representations for Imbalanced Lifelong Learning

AUG-BERT: An Efficient Data Augmentation Algorithm for Text Classification

Notes

References