Abstract
Continual learning models are known to suffer from catastrophic forgetting. Existing regularization methods to countering forgetting operate by penalizing large changes to learned parameters. A significant downside to these methods, however, is that, by effectively freezing model parameters, they gradually suspend the capacity of a model to learn new tasks. In this paper, we explore an alternative approach to the continual learning problem that aims to circumvent this downside. In particular, we ask the question: instead of forcing continual learning models to remember the past, can we modify the learning process from the start, such that the learned representations are less susceptible to forgetting? To this end, we explore multiple methods that could potentially encourage durable representations. We demonstrate empirically that the use of unsupervised auxiliary tasks achieves significant reduction in parameter re-optimization across tasks, and consequently reduces forgetting, without explicitly penalizing forgetting. Moreover, we propose a distance metric to track internal model dynamics across tasks, and use it to gain insight into the workings of our proposed approach, as well as other recently proposed methods.
Similar content being viewed by others
Notes
In this work, we consider image classification tasks.
Note the change in notation here: we used \(\hat{\mathbf {y}}\) earlier to denote the probability distribution. In this case, \(\mathbf {y}\) denotes the class variable.
A trivial case of this is a single layer network whose parameter vector is perpendicular to all input samples \(\mathbf {x}\). One can scale the parameter vector without affecting \({\theta } ^\top \mathbf {x}\).
Note that this corresponds to what is known in the literature as the “multi-head” setting.
We conjecture that our use of the simplified KL divergence measure described in section 4.1 may be obscuring some of the details of the behavior of LwF. We intend to explore this issue further in future work, using a per-task KL divergence measure.
Allowing Aux-1 to train for more epochs per task reduces its intransigence values. However, for a fair comparison, and due to limited computational resources, we limit all experiments to 400 iterations per task.
References
Aljundi R, Belilovsky E, Tuytelaars T, Charlin L, Caccia M, Lin M, Page-Caccia L (2019) Online continual learning with maximal interfered retrieval. In: Wallach H, Larochelle H, Beygelzimer A, d’ Alché-Buc F, Fox E, Garnett R (eds) Advances in neural information processing systems, vol 32. Curran Associates, Inc., pp 11849–11860
Chaudhry A, Dokania PK, Ajanthan T, Torr PHS (2018) Riemannian walk for incremental learning: understanding forgetting and intransigence. In: Proceedings of European conference on computer vision (ECCV), pp 556–572
Clanuwat T, Bober-Irizar M, Kitamoto A, Lamb A, Yamamoto K, Ha D (2018) Deep learning for classical Japanese literature. arXiv:1812.01718
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
El Khatib A, Karray F (2019) Preempting catastrophic forgetting in continual learning models by anticipatory regularization. In: 2019 International joint conference on neural networks (IJCNN), pp 1–7
French RM, Chater N (2002) Using noise to compute error surfaces in connectionist networks: a novel means of reducing catastrophic forgetting. Neural Comput 14:1–15
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems, vol 27. Curran Associates, Inc, pp 2672–2680
Jo J, Bengio Y (2017) Measuring the tendency of CNNS to learn surface statistical regularities. CoRR. arXiv:1711.11561
Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A, Hassabis D, Clopath C, Kumaran D, Hadsell R (2017) Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci 114(13):3521–3526
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Technical report, University of Toronto
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Li Z, Hoiem D (2018) Learning without forgetting. IEEE Trans Pattern Anal Mach Intell 40(12):2935–2947
Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY (2011) Reading digits in natural images with unsupervised feature learning. In: NIPS workshop on deep learning and unsupervised feature learning
Ratcliff R (1990) Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychol Rev 97(2):285–308
Robins A (1993) Catastrophic forgetting in neural networks: the role of rehearsal mechanisms. In: Proceedings of the first New Zealand international two-stream conference on artificial neural networks and expert systems, pp 65–68. https://doi.org/10.1109/ANNES.1993.323080
Rusu AA, Rabinowitz NC, Desjardins G, Soyer H, Kirkpatrick J, Kavukcuoglu K, Pascanu R, Hadsell R (2016) Progressive neural networks. CoRR. arXiv:1606.04671
Shin H, Lee JK, Kim J, Kim J (2017) Continual learning with deep generative replay. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. Curran Associates, Inc., pp 2990–2999
Terekhov AV, Montone G, O’Regan JK (2015) Knowledge transfer in deep block-modular neural networks. In: Wilson SP, Verschure PFMJ, Mura A, Prescott TJ (eds) Biomimetic and biohybrid systems. Springer International Publishing, pp 268–279
Vaswani A, Shazeer N, Parmar M, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Zenke F, Poole B, Ganguli S (2017) Continual learning through synaptic intelligence. arXiv:1703.04200
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
This work was supported in part by Electronics and Telecommunications Research Institute (ETRI) grant funded by the Korean government [20ZS1100, Core Technology Research for Self-Improving Integrated Artificial Intelligence System].
Rights and permissions
About this article
Cite this article
El Khatib, A., Karray, F. Toward durable representations for continual learning. Adv. in Comp. Int. 2, 7 (2022). https://doi.org/10.1007/s43674-021-00022-8
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s43674-021-00022-8