Skip to main content
Log in

Toward durable representations for continual learning

  • Original Article
  • Published:
Advances in Computational Intelligence Aims and scope Submit manuscript

Abstract

Continual learning models are known to suffer from catastrophic forgetting. Existing regularization methods to countering forgetting operate by penalizing large changes to learned parameters. A significant downside to these methods, however, is that, by effectively freezing model parameters, they gradually suspend the capacity of a model to learn new tasks. In this paper, we explore an alternative approach to the continual learning problem that aims to circumvent this downside. In particular, we ask the question: instead of forcing continual learning models to remember the past, can we modify the learning process from the start, such that the learned representations are less susceptible to forgetting? To this end, we explore multiple methods that could potentially encourage durable representations. We demonstrate empirically that the use of unsupervised auxiliary tasks achieves significant reduction in parameter re-optimization across tasks, and consequently reduces forgetting, without explicitly penalizing forgetting. Moreover, we propose a distance metric to track internal model dynamics across tasks, and use it to gain insight into the workings of our proposed approach, as well as other recently proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. In this work, we consider image classification tasks.

  2. Note the change in notation here: we used \(\hat{\mathbf {y}}\) earlier to denote the probability distribution. In this case, \(\mathbf {y}\) denotes the class variable.

  3. A trivial case of this is a single layer network whose parameter vector is perpendicular to all input samples \(\mathbf {x}\). One can scale the parameter vector without affecting \({\theta } ^\top \mathbf {x}\).

  4. Note that this corresponds to what is known in the literature as the “multi-head” setting.

  5. We conjecture that our use of the simplified KL divergence measure described in section 4.1 may be obscuring some of the details of the behavior of LwF. We intend to explore this issue further in future work, using a per-task KL divergence measure.

  6. Allowing Aux-1 to train for more epochs per task reduces its intransigence values. However, for a fair comparison, and due to limited computational resources, we limit all experiments to 400 iterations per task.

References

  • Aljundi R, Belilovsky E, Tuytelaars T, Charlin L, Caccia M, Lin M, Page-Caccia L (2019) Online continual learning with maximal interfered retrieval. In: Wallach H, Larochelle H, Beygelzimer A, d’ Alché-Buc F, Fox E, Garnett R (eds) Advances in neural information processing systems, vol 32. Curran Associates, Inc., pp 11849–11860

  • Chaudhry A, Dokania PK, Ajanthan T, Torr PHS (2018) Riemannian walk for incremental learning: understanding forgetting and intransigence. In: Proceedings of European conference on computer vision (ECCV), pp 556–572

  • Clanuwat T, Bober-Irizar M, Kitamoto A, Lamb A, Yamamoto K, Ha D (2018) Deep learning for classical Japanese literature. arXiv:1812.01718

  • Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  • El Khatib A, Karray F (2019) Preempting catastrophic forgetting in continual learning models by anticipatory regularization. In: 2019 International joint conference on neural networks (IJCNN), pp 1–7

  • French RM, Chater N (2002) Using noise to compute error surfaces in connectionist networks: a novel means of reducing catastrophic forgetting. Neural Comput 14:1–15

    Article  Google Scholar 

  • Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems, vol 27. Curran Associates, Inc, pp 2672–2680

  • Jo J, Bengio Y (2017) Measuring the tendency of CNNS to learn surface statistical regularities. CoRR. arXiv:1711.11561

  • Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A, Hassabis D, Clopath C, Kumaran D, Hadsell R (2017) Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci 114(13):3521–3526

    Article  MathSciNet  Google Scholar 

  • Krizhevsky A (2009) Learning multiple layers of features from tiny images. Technical report, University of Toronto

  • LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  • Li Z, Hoiem D (2018) Learning without forgetting. IEEE Trans Pattern Anal Mach Intell 40(12):2935–2947

    Article  Google Scholar 

  • Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY (2011) Reading digits in natural images with unsupervised feature learning. In: NIPS workshop on deep learning and unsupervised feature learning

  • Ratcliff R (1990) Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychol Rev 97(2):285–308

    Article  Google Scholar 

  • Robins A (1993) Catastrophic forgetting in neural networks: the role of rehearsal mechanisms. In: Proceedings of the first New Zealand international two-stream conference on artificial neural networks and expert systems, pp 65–68. https://doi.org/10.1109/ANNES.1993.323080

  • Rusu AA, Rabinowitz NC, Desjardins G, Soyer H, Kirkpatrick J, Kavukcuoglu K, Pascanu R, Hadsell R (2016) Progressive neural networks. CoRR. arXiv:1606.04671

  • Shin H, Lee JK, Kim J, Kim J (2017) Continual learning with deep generative replay. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. Curran Associates, Inc., pp 2990–2999

  • Terekhov AV, Montone G, O’Regan JK (2015) Knowledge transfer in deep block-modular neural networks. In: Wilson SP, Verschure PFMJ, Mura A, Prescott TJ (eds) Biomimetic and biohybrid systems. Springer International Publishing, pp 268–279

  • Vaswani A, Shazeer N, Parmar M, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  • Zenke F, Poole B, Ganguli S (2017) Continual learning through synaptic intelligence. arXiv:1703.04200

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alaa El Khatib.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

This work was supported in part by Electronics and Telecommunications Research Institute (ETRI) grant funded by the Korean government [20ZS1100, Core Technology Research for Self-Improving Integrated Artificial Intelligence System].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

El Khatib, A., Karray, F. Toward durable representations for continual learning. Adv. in Comp. Int. 2, 7 (2022). https://doi.org/10.1007/s43674-021-00022-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s43674-021-00022-8

Keywords

Navigation