Leonid Butyrev

Leonid Butyrev

Master's Thesis

Overcoming Catastrophic Forgetting via Hessian-free Curvature Estimates

Dr. Christopher Mutschler, Dr. Georgios Kontes(Frauenhofer IIS), Prof. Dr. Björn Eskofier

08/2019 – 02/2020

Neural networks are well known for their strong performance on tasks like computer vision [1] and natural language processing [2]. Nonetheless there are some major shortcomings that make their application in some domains difficult, if not impossible. For example, in Machine/Deep Learning applications, if the input distribution shifts over time, a learned model might overfit to the most recently seen data, forgetting the rest – a phenomenon referred to as catastrophic forgetting [3]. A naive approach to address this issue would be that the training data includes most (if not all) important cases and a training algorithm utilizes all available data to re-train periodically. It goes without saying that in real-world applications this condition is impossible to meet. For example, consider the problem of autonomous driving; here new situations (e.g. near accidents) occur at extremely high rates as (semi-)autonomous fleets are deployed in public roads. In this case, maintaining (and re-training for) all relevant data becomes impractical. Motivated by these core issues, traditional approaches attempt to leverage the advantages of inductive bias [4], which has been shown to lead to higher sample efficiency and better generalization to novel tasks from the same task family [5]. This is often done by pretraining on a source task in some source domain, and then finetuning on the target task in the target domain, which is commonly known as transfer learning [6]. One of the main downsides of this approach is the fact that the transfer process is often times destructive since the parameters of the source task are basically overwritten by those of the target task [7]. This becomes an issue if one wants to continuously re-use the knowledge of past tasks to maximize sample efficiency for some arbitrary future tasks. Or retain good performance in all previous tasks while learning a new one (in a context similar to incremental learning).

This is why in the context of continuous learning or lifelong learning [8], transfer is not only attempted in the direction of the target task, but also in the direction of the source tasks. More formally, as defined in [3], is characterized in practice by a series of desiderata:

  • Online learning – learning occurs at every moment, with no fixed tasks or data sets and no clear boundaries between tasks;
  • Presence of transfer (forward/backward) – the learning agent should be able to transfer and adapt what it learned from previous experience, data, or tasks to new situations, as well as make use of more recent experience to improve performance on capabilities learned earlier;
  • Resistance to catastrophic forgetting
  • Bounded system (network) size;
  • No direct access to (all) previous experience.

Recent work in continual learning mostly addresses the catastrophic forgetting problem, which is oftentimes mitigated by adding new networks for additional tasks [9] [10] [11] [7]. While all knowledge of past tasks is retained by design, these approaches also introduces strong constraints on scalability (as we don’t have a bounded system size). This is due to the fact that they often times either require the manual extension [7] of the model, or introduce rather complex architectures to control for the amount of information that is shared [11].
More promising approaches attempt to mitigate catastrophic forgetting by punishing changes in the most important weights of previous tasks [12] [13] [14]. One such approach is the elastic weight consolidation (EWC) algorithm [12], which uses core ideas that are common among most algorithms in this group of approaches. One of the most important upsides of EWC is that it is simple enough in its implementation that any existing model could be extended easily to be able to handle catastrophic forgetting, without having to sacrifice computational efficiency. More importantly, it addresses mostly all desiderata described above, by design.

Nonetheless it has two notable downsides: For one, it requires the storage of the diagonal of the Fisher information matrix (FIM) of the parameters for all previous tasks, which again limits the scalability. Additionally, since it makes use of the diagonal of the FIM the underlying model assumes independents of the parameters which rarely is the case, and therefore introduces an estimation error. The goal of the master thesis is twofold: I) try to address both issues described above by extending the learning model of EWC to an online learning framework (as in [15] [16]), and by improving the FIM estimate through Hessian-free curvature estimation techniques, and ii) to apply and evaluate the resulting algorithm (also through comparison to other key approaches like Online Laplace Approximation [15]) to a Deep Reinforcement Learning problem. One way to evaluate the proposed algorithm would be the NeurIPS 2019 “Learn to Move – Walk Around” Challenge [17]. There the task is to develop a controller for a physiologically plausible 3D human model to walk or run following velocity commands with minimum effort. The advantages of mitigating Catastrophic Forgetting can be exploited by addressing this problem as an incremental class learning problem, in which the humanoid will incrementally learn to walk and run by learning simpler movements first (e.g. standing, taking few steps, walking forward, turning etc.), while attempting to preserve high performance for all previously learned sub-tasks (i.e. not forgetting how to stand when learning to run).


  1. A. Krizhevsky, I. Sutskever und G. E. Hinton, „Imagenet classification with deep convolutional neural networks,“ in Advances in neural information processing systems, 2012.
  2. A. Graves, A.-r. Mohamed und G. Hinton, „Speech recognition with deep recurrent neural networks,“ in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, 2013.
  3. „Continual Learning workshop, NeurIPS 2018,“ [Online]. Available: https://sites.google.com/view/continual2018/home .
  4. T. M. Mitchell, The need for biases in learning generalizations, Department of Computer Science, Laboratory for Computer Science Research, Rutgers Univ. New Jersey, 1980.
  5. J. Baxter, „A model of inductive bias learning,“ Journal of Artificial Intelligence Research, Bd. 12, pp. 149-198, 2000.
  6. S. J. Pan, Q. Yang und others, „A survey on transfer learning,“ IEEE Transactions on knowledge and data engineering, Bd. 22, pp. 1345-1359, 2010.
  7. A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu und R. Hadsell, „Progressive neural networks,“ arXiv preprint arXiv:1606.04671, 2016.
  8. P. Ruvolo und E. Eaton, „ELLA: An efficient lifelong learning algorithm,“ in International Conference on Machine Learning, 2013.
  9. Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi und R. S. Feris, „Fully-adaptive Feature Sharing in Multi-Task Networks with Applications in Person Attribute Classification.,“ in CVPR, 2017.
  10. I. Misra, A. Shrivastava, A. Gupta und M. Hebert, „Cross-stitch networks for multi-task learning,“ in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  11. S. Ruder12, J. Bingel, I. Augenstein und A. Søgaard, „Sluice networks: Learning what to share between loosely related tasks,“ stat, Bd. 1050, p. 23, 2017.
  12. J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska und others, „Overcoming catastrophic forgetting in neural networks,“ Proceedings of the national academy of sciences, Bd. 114, pp. 3521-3526, 2017.
  13. F. Zenke, B. Poole und S. Ganguli, „Continual learning through synaptic intelligence,“ in Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017.
  14. J. Serrà, D. Surı́s, M. Miron und A. Karatzoglou, „Overcoming catastrophic forgetting with hard attention to the task,“ arXiv preprint arXiv:1801.01423, 2018.
  15. H. Ritter, A. Botev und D. Barber, „Online structured laplace approximations for overcoming catastrophic forgetting,“ in Advances in Neural Information Processing Systems, 2018.
  16. F. Huszár, „Note on the quadratic penalties in elastic weight consolidation,“ Proceedings of the National Academy of Sciences, p. 201717042, 2018.
  17. „NeurIPS 2019 Learn to Move – Walk Around Challenge,“ [Online]. Available: https://www.aicrowd.com/challenges/neurips-2019-learn-to-move-walk-around .