Memory Aware Synapses: Learning What (not) to Forget

  • Rahaf AljundiEmail author
  • Francesca Babiloni
  • Mohamed Elhoseiny
  • Marcus Rohrbach
  • Tinne Tuytelaars
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11207)


Humans can learn in a continuous manner. Old rarely utilized knowledge can be overwritten by new incoming information while important, frequently used knowledge is prevented from being erased. In artificial learning systems, lifelong learning so far has focused mainly on accumulating knowledge over tasks and overcoming catastrophic forgetting. In this paper, we argue that, given the limited model capacity and the unlimited new information to be learned, knowledge has to be preserved or erased selectively. Inspired by neuroplasticity, we propose a novel approach for lifelong learning, coined Memory Aware Synapses (MAS). It computes the importance of the parameters of a neural network in an unsupervised and online manner. Given a new sample which is fed to the network, MAS accumulates an importance measure for each parameter of the network, based on how sensitive the predicted output function is to a change in this parameter. When learning a new task, changes to important parameters can then be penalized, effectively preventing important knowledge related to previous tasks from being overwritten. Further, we show an interesting connection between a local version of our method and Hebb’s rule, which is a model for the learning process in the brain. We test our method on a sequence of object recognition tasks and on the challenging problem of learning an embedding for predicting <subject, predicate, object> triplets. We show state-of-the-art performance and, for the first time, the ability to adapt the importance of the parameters based on unlabeled data towards what the network needs (not) to forget, which may vary depending on test conditions.


Previous Task Catastrophic Forgetting Limited Capacity Model Online Manner Importance Weights 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The first author’s PhD is funded by an FWO scholarship.

Supplementary material

474178_1_En_9_MOESM1_ESM.pdf (648 kb)
Supplementary material 1 (pdf 648 KB)

Supplementary material 2 (mp4 64912 KB)


  1. 1.
    Aljundi, R., Chakravarty, P., Tuytelaars, T.: Expert gate: lifelong learning with a network of experts. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  2. 2.
    Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 3981–3989. Curran Associates, Inc. (2016)Google Scholar
  3. 3.
    de Campos, T.E., Babu, B.R., Varma, M.: Character recognition in natural images. In: Proceedings of the International Conference on Computer Vision Theory and Applications, Lisbon, Portugal, February 2009Google Scholar
  4. 4.
    Elhoseiny, M., Cohen, S., Chang, W., Price, B.L., Elgammal, A.M.: Sherlock: scalable fact learning in images. In: AAAI, pp. 4016–4024 (2017)Google Scholar
  5. 5.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results (2012).
  6. 6.
    Fernando, C., et al.: PathNet: evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734 (2017)
  7. 7.
    Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the International Conference on Machine Learning (ICML) (2017)Google Scholar
  8. 8.
    French, R.M.: Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 3(4), 128–135 (1999)CrossRefGoogle Scholar
  9. 9.
    Goodfellow, I.J., Mirza, M., Xiao, D., Courville, A., Bengio, Y.: An empirical investigation of catastrophic forgetting in gradient-based neural networks. In: ICLR 2014 (2014)Google Scholar
  10. 10.
    Hebb, D.: The organization of behavior 1949. New York Wiely 2, 8 (2002)Google Scholar
  11. 11.
    Huszár, F.: Note on the quadratic penalties in elastic weight consolidation. In: Proceedings of the National Academy of Sciences (2018). Scholar
  12. 12.
    Kirkpatrick, J., et al.: Overcoming catastrophic forgetting in neural networks. arXiv preprint arXiv:1612.00796 (2016)
  13. 13.
    Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561 (2013)Google Scholar
  14. 14.
    Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 (2014)
  15. 15.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc. (2012)Google Scholar
  16. 16.
    Lee, S.W., Kim, J.H., Jun, J., Ha, J.W., Zhang, B.T.: Overcoming catastrophic forgetting by incremental moment matching. In: Advances in Neural Information Processing Systems, pp. 4652–4662 (2017)Google Scholar
  17. 17.
    Li, Z., Hoiem, D.: Learning without forgetting. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 614–629. Springer, Cham (2016). Scholar
  18. 18.
    Lopez-Paz, D., Ranzato, M.: Gradient episodic memory for continual learning. In: Advances in Neural Information Processing Systems, pp. 6470–6479 (2017)Google Scholar
  19. 19.
    Maji, S., Kannala, J., Rahtu, E., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. Technical report (2013)Google Scholar
  20. 20.
    McCloskey, M., Cohen, N.J.: Catastrophic interference in connectionist networks: the sequential learning problem. Psychol. Learn. Motiv. 24, 109–165 (1989)CrossRefGoogle Scholar
  21. 21.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  22. 22.
    Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning (2011)Google Scholar
  23. 23.
    Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, December 2008Google Scholar
  24. 24.
    Pentina, A., Lampert, C.H.: Lifelong learning with non-iid tasks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 1540–1548 (2015)Google Scholar
  25. 25.
    Pentina, A., Lampert, C.H.: Lifelong learning with non-iid tasks. In: Advances in Neural Information Processing Systems, pp. 1540–1548 (2015)Google Scholar
  26. 26.
    Quadrianto, N., Petterson, J., Smola, A.J.: Distribution matching for transduction. In: Advances in Neural Information Processing Systems, pp. 1500–1508 (2009)Google Scholar
  27. 27.
    Quattoni, A., Torralba, A.: Recognizing indoor scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 413–420. IEEE (2009)Google Scholar
  28. 28.
    Rannen, A., Aljundi, R., Blaschko, M.B., Tuytelaars, T.: Encoder based lifelong learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1320–1328 (2017)Google Scholar
  29. 29.
    Rebuffi, S.A., Kolesnikov, A., Sperl, G., Lampert, C.H.: iCaRL: incremental classifier and representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
  30. 30.
    Ring, M.B.: Child: a first step towards continual learning. Mach. Learn. 28(1), 77–104 (1997)CrossRefGoogle Scholar
  31. 31.
    Royer, A., Lampert, C.H.: Classifier adaptation at prediction time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1401–1409 (2015)Google Scholar
  32. 32.
    Russakovsky, O.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). Scholar
  33. 33.
    Rusu, A.A., et al.: Progressive neural networks. arXiv preprint arXiv:1606.04671 (2016)
  34. 34.
    Shmelkov, K., Schmid, C., Alahari, K.: Incremental learning of object detectors without catastrophic forgetting. In: The IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  35. 35.
    Silver, D.L., Yang, Q., Li, L.: Lifelong machine learning systems: beyond learning algorithms. In: AAAI Spring Symposium: Lifelong Machine Learning, pp. 49–55. Citeseer (2013)Google Scholar
  36. 36.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  37. 37.
    Thrun, S., Mitchell, T.M.: Lifelong robot learning. Robot. Auton. Syst. 15(1–2), 25–46 (1995)CrossRefGoogle Scholar
  38. 38.
    Welinder, P., et al.: Caltech-UCSD Birds 200. Technical report CNS-TR-2010-001, California Institute of Technology (2010)Google Scholar
  39. 39.
    Zenke, F., Poole, B., Ganguli, S.: Improved multitask learning through synaptic intelligence. In: Proceedings of the International Conference on Machine Learning (ICML) (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Rahaf Aljundi
    • 1
    Email author
  • Francesca Babiloni
    • 1
  • Mohamed Elhoseiny
    • 2
  • Marcus Rohrbach
    • 2
  • Tinne Tuytelaars
    • 1
  1. 1.KU Leuven, ESAT-PSI, imecLeuvenBelgium
  2. 2.Facebook AI ResearchMenlo ParkUSA

Personalised recommendations