1 Introduction

Acquiring new skills and concepts without forgetting previously acquired knowledge is a hallmark of human and animal intelligence. Biological learning systems leverage task-relevant knowledge from preceding learning episodes to guide subsequent learning of new tasks to accomplish this. Artificial learning systems, such as neural networks, usually lack this crucial property and experience a problem coined "catastrophic forgetting" (McCloskey & Cohen, 1989). Catastrophic forgetting occurs when we naively apply machine learning algorithms to solve a sequence of tasks \(T_{1:t}\), where the adaptation to task \(T_t\) prompts overwriting of the parameters learned for tasks \(T_{1:t-1}\).

The Continual Learning (CL) paradigm (Thrun, 1998) has emerged as a way to investigate such problems systematically. We can divide CL approaches into four broad categories: generative approaches with memory consolidation, regularization, architecture and expansion methods, and algorithm-based methods. Generative methods train a generative model to learn the data-generating distribution to reproduce data of old tasks. Data sampled from the learned model is then part of the training process (Shin et al., 2017; Rebuffi et al., 2017). This strategy draws inspiration from neuroscience research regarding the reactivation of neuronal activity patterns representing previous experiences that are hypothesized to be vital for stabilizing new memories while retaining old memories (Wilson & McNaughton, 1994; Rasch & Born, 2007; van de Ven et al., 2016). In contrast, regularization methods (Kirkpatrick et al., 2017; Zenke et al., 2017; Ahn et al., 2019; Benavides-Prado et al., 2020; Han & Guo, 2021) introduce an additional constraint to the learning objective. The goal is to prevent changes in task-relevant parameters, where we may measure relevance as, e.g., the performance on previously seen tasks (Kirkpatrick et al., 2017; Zenke et al., 2017; Li et al., 2021; Cha et al., 2020). This approach can also be motivated through synaptic plasticity and elasticity changes of biological neurons when learning new tasks (Ostapenko et al., 2019). CL can also be achieved by modifying the design of a model during learning (Lin et al., 2019; Fernando et al., 2017; Rusu et al., 2016; Yoon et al., 2018; Golkar et al., 2019). Such methods include adding new layers to a neural network (Zacarias & Alexandre, 2018), re-routing data through a neural network based on task information (Collier et al., 2020), distilling parameters (Zhai et al., 2019; Liu et al., 2020), and adding new experts to a mixture-of-experts architecture (Lee et al., 2020). Lastly, algorithmic approaches aim to adapt the optimization algorithm itself to avoid catastrophic forgetting, e.g., by mapping the gradient updates into a different space (Zeng et al., 2019; Wang et al., 2021). Some methods provide a combination of approaches, as it seems plausible that a mixture of approaches will provide the best-performing systems, as these methods are often complementary (Biesialska et al., 2020; De Lange et al., 2021; Vijayan & Sridhar, 2021).

Fig. 1
figure 1

Left: Multi-head architectures share a feature extractor block and select task-specific output heads based on the knowledge provided by a task oracle. Right: Our proposed architecture for single head CL. Each layer implements a modular system, such that task data can flow through distinct ”paths” (highlighted) in the network

While there has been significant progress in the field of CL, there are still some major open questions (Parisi et al., 2019). For example, most existing algorithms share a significant drawback in that they require task-specific knowledge, such as the number of tasks and which task is currently at hand (Zenke et al., 2017; Shin et al., 2017; Nguyen et al., 2017; Kirkpatrick et al., 2017; Li & Hoiem, 2017; Rao et al., 2019; Sokar et al., 2021; Yoon et al., 2018; Han & Guo, 2021; Chaudhry et al., 2021, 2018). One prominent class of CL approaches sharing this drawback are multi-head approaches (El Khatib & Karray, 2019; Nguyen et al., 2017; Ahn et al., 2019), which build a set of shared layers but a separate output layer ("head") per task, deterministically activated by the current task index (see Fig. 1).

Extracting relevant task information is in general a difficult problem, in particular when distinguishing tasks without any contextual input (Hihn & Braun, 2020; Yao et al., 2019). Thus, providing the model with such task-relevant information yields overly optimistic results (Chaudhry et al., 2018). In order to deal with more realistic and challenging CL scenarios, therefore, models must learn to compensate for the lack of auxiliary information. The approach we propose in this work tackles this problem by formulating a hierarchical learning system, that allows us to learn a set of sub-modules specialized in solving particular tasks. To this end, we introduce hierarchical variational continual learning (HVCL) and devise the mixture-of-variational-experts layer (MoVE layers) as an instantiation of HVCL. MoVE layers consist of M experts governed by a gating policy, where each expert maintains a posterior distribution over its parameters alongside a corresponding prior. During each forward pass, the gating policy selects one expert per layer. This sparse selection reduces computation as only a small subset of the parameters must be updated during the back-propagation of the loss (Shazeer et al., 2017). To mitigate catastrophic forgetting we condition the prior distributions on previously observed tasks and add a penalty term on the Kullback-Leibler-Divergence between the expert posterior and its prior. This constraint facilitates a trade-off between learning and forgetting and allows us to design information-efficient decision-makers (Hihn & Braun, 2020).

When dealing with ensemble methods, two main questions arise naturally. The first one concerns the question of optimally selecting ensemble members using appropriate selection and fusion strategies (Kuncheva, 2004). The second one, is the question of how to ensure expert diversity (Kuncheva & Whitaker, 2003; Bian & Chen, 2021). We argue that ensemble diversity benefits continual learning and investigate two complementary diversity objectives: the entropy of the expert selection process and a similarity measure between different experts based on Wasserstein exponential kernels in the context of determinantal point processes (Kulesza et al., 2012). By maximizing the determinant of the expert similarity matrix, we can then “spread” the expert parameters optimally within the shared parameter space. To summarize, our contributions are the following: (1) we extend variational continual learning (Nguyen et al., 2017) to a hierarchical multi-prior setting, (2) we derive a computationally efficient method for task-agnostic continual learning from this general formulation, (3) to improve expert specialization and diversity, we introduce and evaluate novel diversity measures (4) we demonstrate our approach in supervised CL, generative CL, and continual reinforcement learning.

This paper is structured as follows: after introducing our method in Sect. 2, we design, perform, and evaluate the main experiments in Sect. 3. In Sect. 4, we discuss novel aspects of the current study in the context of previous literature and conclude with a final summary in Sect. 5.

2 Hierarchical variational continual learning

In this section we first introduce the concept of continual learning and related nomenclature formally and then extend the variational continual learning (VCL) setting introduced by Nguyen et al. (2017) to a hierarchical multi-prior setting and then introduce a neural network implementation as a generalized application of this paradigm in Sect. 2.1.

In CL the goal is to minimize the loss of all seen tasks given no (or limited) access to data from previous tasks:

$$\begin{aligned} \min \sum _{t=1}^T \mathbb {E}_{(x_t,y_t)\thicksim (X_t,Y_t)}\left[ \ell (f_\theta (x_t, y_t)) \right] , \end{aligned}$$
(1)

where T is the total number of tasks, \(\mathcal {D}_t = \{x_t^i, y_t^i\}_{i=1}^{N_t} = (X_t, Y_t)\) the dataset of task t, \(\ell\) some loss function, and \(f_\theta\) a single predictor parameterized by \(\theta\) (e.g., a neural network). Here, task refers to an isolated training stage with a new dataset. This dataset may belong to a new set of classes, a new domain, or a new output space. We can divide these concepts based on the in- and output distributions (De Lange et al., 2021): in task-incremental learning the labels change \({Y_t} \ne {Y_{t+1}}\), but the label distributions remain \(P(Y_t) = P(Y_{t+1})\), e.g., in a series of binary classification tasks, plus the task t is always known. In domain-incremental learning we have \({Y_t} = {Y_{t+1}}\) and \(P(Y_t) = P(Y_{t+1})\), but t is unknown. Finally, in class-incremental learning we have \({Y_t} \subset {Y_{t+1}}\), but \(P(Y_t) \ne P(Y_{t+1})\) and unknown t, e.g., MNIST with ten total classes but only two classes per task. For all settings it holds that \(X_t \ne X_{t+1}\)—see Fig. 2 for an illustration.

Fig. 2
figure 2

Continual Learning setups. In Task Incremental Learning the number of classes is constant and task information is provided. In Domain Incremental Learning, the number of classes is also constant, but no task information is given. In Class Incremental Learning the total number of classes is constant, but the learning agent only observes a subset at a time, e.g., two out of ten total classes per task

The variational continual learning approach (Nguyen et al., 2017) describes a general learning paradigm wherein an agent stays close to an old strategy ("prior") it has learned on a previous task \(T_{t-1}\) while learning to solve a new task \(T_t\) ("posterior"). Given datasets of input-output pairs \(\mathcal {D}_t = \{x_t^i, y_t^i\} _{i=0}^{N_t}\) of tasks \(t \in \{1, ..., T\}\), the main learning objective of minimizing the log-likelihood \(\log p_\theta (y_t^{i}\vert x_t^{i})\) for task t is augmented with an additional loss term in the following way:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\text {VCL}}^t = \sum _{i=1}^{N_t} \mathbb {E}_{\theta }\left[ \log p_\theta (y_t^{i}\vert x_t^{i})\right] - {{\,\mathrm{\text {D}_\text {KL}}\,}}\left[ p_t(\theta )\vert \vert p_{t-1}(\theta ) \right] , \end{aligned} \end{aligned}$$
(2)

where \(p(\theta )\) is a distribution over the models parameter \(\theta\) and \(N_t\) is the number of samples for task t. The prior constraint encourages the agent to find an optimal trade-off between solving a new task and retaining knowledge about old tasks. When the likelihood model is implemented by a neural network, a new output layer can be associated with each incoming task, resulting in a multi-head implementation. Over the course of T datasets, Bayes’ rule then recovers the posterior

$$\begin{aligned} p(\theta \vert \mathcal {D}_{1:T})&\propto p(\theta ) \prod _{t=1}^{T}\prod _{i=1}^{N_t}p(y_t^{i}\vert \theta ,x_t^{i}) \nonumber \\&= p(\theta )\prod _{t=1}^T p(\mathcal {D}_t\vert \theta ) \\&\propto p(\theta \vert \mathcal {D}_{1:T-1})p(\mathcal {D}_T\vert \theta ), \nonumber \end{aligned}$$
(3)

which forms a recursion: the posterior after seeing T datasets is obtained by multiplying the posterior after \(T-1\) with the likelihood and normalizing accordingly.

In their original work, the authors propose to use multi-headed networks, i.e., to train a new output layer for each incoming task. This strategy has two main drawbacks: (1) it introduces an organizational overhead due to the growing number of network heads, and (2) task boundaries must be known at all times, making it unsuitable for more complex continual learning settings. In the following we argue that we can alleviate these problems by combining multiple decision-makers with a learned selection policy to replace the deterministic head selection.

To extend VCL to the hierarchical case, we assume that samples are drawn from a set of M independent data generating processes, i.e. the likelihood is given by a mixture model \(p(y\vert x) = \sum _{m=1}^M p(m\vert x)p(y\vert m,x)\). We define an indicator variable \(z \in Z\), where \(z_m^{i,t}\) is 1 if the output \(y_i^t\) of sample i from task t was generated by expert m and zero otherwise. The conditional probability of an output is then given by

$$\begin{aligned} p(y_t^i\vert x_t^i,\Theta ) = \sum _{m=1}^M p(z_t^{i,m}\vert x_t^i,\vartheta )p(y_t^i\vert x_t^i,\omega _m), \end{aligned}$$
(4)

where \(\vartheta\) are the parameters of the selection policy, \(\omega _m\) the parameters of the m-th expert, and \(\Theta = \{\vartheta , \{\omega _m\}_{m=1}^M\}\) the combined model parameters. The posterior after observing T tasks is then given by

$$\begin{aligned} p(\Theta \vert \mathcal {D}_{1:T})&\propto p(\vartheta )p(\omega ) \prod _{t=1}^{T}\prod _{i=1}^{N_t} \sum _{m=1}^M p(z_t^{i,m}\vert x_t^i,\vartheta )p(y_t^i\vert x_t^i,\omega _m) \nonumber \\&= p(\Theta )\prod _{t=1}^T p(\mathcal {D}_t\vert \Theta ) \\&\propto p(\Theta \vert \mathcal {D}_{1:T-1})p(\mathcal {D}_T\vert \Theta ). \nonumber \end{aligned}$$
(5)

The Bayes posterior of an expert \(p(\omega _m\vert \mathcal {D}_{1:T})\) is recovered by computing the marginal over the selection variables Z. Again, this forms a recursion, in which the posterior \(p(\Theta \vert \mathcal {D}_{1:T})\) depends on the posterior after seeing \(T-1\) tasks and the likelihood \(p(\mathcal {D}_T\vert \Theta )\). We can now formulate the hierarchical variational continual learning objective for task t as minimizing the following loss:

$$\begin{aligned} {\mathcal{L}}_{{{\text{HVCL}}}}^{t} = & \sum\limits_{{i = 1}}^{{N_{t} }} {{\mathbb{E}}_{{p(\Theta )}} } \left[ {\log p(y_{t}^{i} |x_{t}^{i} ,\Theta )} \right] \\ & \quad - {\text{D}}_{{{\text{KL}}}} {\mkern 1mu} \left[ {p_{t} (\vartheta )||p_{{1:t - 1}} (\vartheta )} \right] \\ & \quad - {\text{D}}_{{{\text{KL}}}} {\mkern 1mu} \left[ {p_{t} (\omega )||p_{{1:t - 1}} (\omega )} \right], \\ \end{aligned}$$
(6)

where \(N_t\) is the number of samples in task t, and the likelihood \(p(y\vert x, \Theta )\) is defined as in Eq. (4). The Mixture-of-Variational-Experts layers we introduce in Sect. 2.1 are based on a generalization of this optimization problem.

2.1 Sparsely gated mixture-of-variational layers

Fig. 3
figure 3

This figure illustrates our proposed design. Each layer implements a top-k expert selection conditioned on the output of the previous layer. Each expert m maintains a distribution over its weights \(p(\theta \vert m) = \mathcal {N}(\mu _m, \sigma _m)\) and a set of bias variables \(b_{m}\). Left to right arrows represent sampling, while right to left arrows show the resulting utility and losses. Combining L layers with M experts gives \(L^M\) possible paths through the network. The architecture shown here is used in the CIFAR10 experiments

As we plan to tackle not only supervised learning problems, but also reinforcement learning problems, we assume in the following a generic scalar utility function \(\mathbf {U}(x,f_\theta (x))\) that depends both on the input x and the parameterized agent function \(f_\theta (x)\) that generates the agent’s output y. We assume that the agent function \(f_\theta (x)\) is composed of multiple layers as depicted in Fig. 3. Our layer design builds on the sparsely gated Mixture-of-Expert (MoE) layers (Shazeer et al., 2017), which in turn draws on the Mixture-of-Experts paradigm introduced by Jacobs et al. (1991). MoEs consist of a set of M experts indexed by m and a gating network \(p(m\vert x)\) whose output is a (sparse) M-dimensional vector. All experts have an identical architecture but separate parameters. Let \(p(m\vert x)\) be the gating output and \(p(y\vert m,x)\) the response of an expert m given input x. The layer’s output is then given by a weighted sum of the experts responses, i.e.,  \(p(y\vert x) = \sum _{m \in M} p(m\vert x)p(y\vert m,x)\). To save computation time we employ a top-k gating scheme, where only the k experts with highest gating activation are evaluated and use an additional penalty that encourages gating sparsity (see Sect. 2.1). In all our experiments we set \(k = 1\), to drive expert specialization (see Sect. 2.2.1) and reduce computation time. We implement the learning objective for task t as layer-wise regularization in the following way:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{\text {MoVE}}^{t}&= \sum _{i=1}^{N_t} \bigg [\mathbb {E}_{\Theta }\left[ \mathbf {U}(x_i^t, f_\Theta (x_i^t))\right] \\&\quad - \sum _{l=1}^{L} \mathbb {E}_{p^l(m\vert x_i^t), q^l(\theta \vert m)} \bigg [\beta _1 {{\,\mathrm{\text {D}_\text {KL}}\,}}\left[ p_t^l(m\vert x_i^t)\vert \vert p_{1:t-1}^l(m\vert x_i^t)\right] \\&\quad + \beta _2 {{\,\mathrm{\text {D}_\text {KL}}\,}}\left[ p_t^l(\theta \vert m)\vert \vert p_{1:t-1}^l(\theta \vert m) \right] \bigg ] \bigg ], \end{aligned} \end{aligned}$$
(7)

where L is the total number of layers, \(\Theta = \{\theta , \{\vartheta _m\}_{m=1}^M\}\) the combined parameters,and the temperature parameters \(\beta _1\) and \(\beta _2\) govern the layer-wise trade-off between utility (e.g., classification performance) and information-cost.

Thus, we allow for two major generalizations compared to Eq. 6: in lieu of the log-likelihood we allow for generic utility functions \(\mathbf {U}(\cdot )\), and instead of applying the constraint on the gating parameters, we apply it directly on the gating output distribution \(p(m\vert x)\). This implies, that the weights of the gating policy are not sampled. Otherwise the gating mechanism would involve two stochastic steps: one in sampling the weights and a second one in sampling the experts. This potentially high selection variance hinders expert specialization (see Sect. 2.2.1). Encouraging the gating policy to stay close to its prior also ensures that similar inputs are assigned to the same expert. Next we consider how we could extend objective 7 further by additional terms that encourage diversity between different experts.

2.2 Encouraging expert diversity

In the following, we argue that a diverse set of experts may mitigate catastrophic forgetting in continual learning, as experts specialize more easily in different tasks, which improves expert selection. Diversity measures enjoy an increasing interest in the reinforcement learning community but remain mainly understudied in continual learning (e.g., Bang et al., 2021). In the reinforcement learning literature diversity has been considered, for example, by encouraging skills or policies that are sufficiently different (Eysenbach et al., 2018; Parker-Holder et al., 2020), or by sampling trajectories that reflect goal diversity (Dai et al., 2021), which is an idea similar to well-known bagging techniques (Breiman, 1996). Moreover, diversity may arise from a sufficiently high variance weight initialization, but this can introduce computational instabilities during back-propagation, as we lose the variance reducing benefits of state-of-the-art initialization schemes (Glorot et al., 2010; He et al., 2015; Narkhede et al., 2021). Also, there is no guarantee that the expert parameters won’t collapse again during training.

In the following, we present two expert diversity objectives. The first one arises directly from the main learning objective and is designed to act as a regularizer on the gating policy while the second one is a more sophisticated approach that aims for diversity in the expert parameter space. The latter formulation introduces a new class of diversity measures, as we discuss in more detail in Sect. 4.1. We designed additional experiments in Sect. 3.4 to investigate their influence on learning and the resulting policies and to further motivate the need for expert diversity.

2.2.1 Diversity through specialization

The relationship between objectives of the form described by Eq. (7) with the emergence of expert specialization has been previously investigated for simple learning problems (Genewein et al., 2015) and in the context of meta-learning (Hihn & Braun, 2020), but not in the context of continual learning. This class of models assumes a two-level hierarchical system of specialized decision-makers where first level decision-makers \(p(m\vert x)\) select which second level decision-maker \(p(y\vert m,x)\) serves as experts for a particular input x. By co-optimizing

$$\begin{aligned} \max _{p(y\vert x,m), p(m\vert x)} {{\,\mathrm{\mathbb {E}}\,}}[\mathbf {U}(x,y)] - \beta _1 I(X;M) - \beta _2 I(X;Y\vert M), \end{aligned}$$
(8)

the combined system finds an optimal partitioning of the input space X, where \(I(\cdot \vert \cdot )\) denotes the (conditional) mutual information between random variables. In fact, the hierarchical VCL objective given by Eq. (6) can be regarded as a special case of the information-theoretic objective given by Eq. (8), if we interpret the prior as the learning strategy of task \(t-1\) and the posterior as the strategy of task t, and set \(\beta = 1\). All these hierarchical decision systems correspond to a multi prior setting, where different priors associated with different experts can specialize on different sub-regions of the input space. In contrast, specialization in the context of continual learning can be regarded as the ability of partitioning the task space, where each expert decision-maker m solves a subset of old tasks \(T^m \subseteq T_{1:t}\). In both cases, expert diversity is a natural consequence of specialization if the gating policy \(p(m\vert x)\) partitions between the experts. Using gradient descent on parameterized distributions, objective (8) can also be maximized in an online manner (Hihn & Braun, 2020).

In addition to the implicit pressures for specialization already implied by Eq. 7, here we investigate the effect of an additional entropy cost. Inspired by recent entropy regularization techniques (Eysenbach et al., 2018; Galashov et al., 2019; Grau-Moya et al., 2019), we aim to improve the gating policy by introducing the entropy cost

$$\begin{aligned} \max _{p(m\vert x)} H(M\vert X) - H(M), \end{aligned}$$
(9)

where M is the set of experts and X the inputs. By maximizing the conditional entropy \(H(M\vert X) = -\sum _{m,x} p(x) p(m\vert x) \log p(m\vert x)\) we encourage high certainty in the expert gating and by minimizing the marginal entropy H(M) we prefer solutions that minimize the number of active experts. In our implementation, we compute these values batch-wise, as the full entropies are not tractable. We evaluate these entropy penalties in Sect. 3.4.

2.2.2 Parameter diversity

Fig. 4
figure 4

Upper Row: We seek experts that are both specialized, i.e., their posterior p is close to their prior q, and diverse, i.e., posteriors \(p_i,p_j \forall i \ne j\) are sufficiently distant from one another. Bottom Row: To this effect, we maximize the determinant of the kernel matrix K, effectively filling the feature space. In the case of two experts this would mean to maximize \(det({\textbf {K}}) = 1 - K(p_0,p_1)\), which we can achieve by maximizing the Wasserstein-2 distance between the posteriors \(p_0\) and \(p_1\)

Our second diversity formulation is based on differences in the parameter space, rather than in the output space as formalized by Eqs. (9) or (8). To find expert parameters that are pairwise different, we introduce a symmetric distance measure and we show how this measure can be efficiently computed and maximized. By maximizing distance in parameter space, we hope to find a set of expert parameters stretched over the space of possible parameters. This in turn helps to prevent collapsing to a state where all experts have similar parameters (and thus similar outputs), rendering the idea behind an ensemble useless.

Our idea builds on determinantal point processes (DPPs) (Kulesza et al., 2012), a mechanism that produces diverse subsets by sampling proportionally to the determinant of the kernel matrix of points within the subset (Macchi, 1975). A point process P on a ground set Y is a probability measure over finite subsets of Y. A sample from P may be the empty set, the entirety of Y, or anything in between. P is a determinantal point process if, when Y is a random subset drawn according to P, we have, for every\(A \subset Y\), \(P(A \subset Y) = \mathrm {det}(K_A)\) for some real, symmetric \(N\times N\) matrix K indexed by the elements of Y. Here, \(K_A = [K_{ij}]_{i,j\in A}\) denotes the restriction of K to the entries indexed by elements of A, and we adopt \(\mathrm {det}(K_\emptyset ) = 1\). Since P is a probability measure, all principal minors \(\mathrm {det}(KA)\) of K must be non-negative, and thus K itself must be positive semi-definite. These requirements turn out to be sufficient: any K, \(0\le K \le I\), defines a DPP. We refer to K as the marginal kernel since it contains all the information needed to compute the probability of any subset A of Y. If \(A = \{i\}\) is a singleton, then we have \(P(i\in Y) = K_{i,i}\). In this case, the diagonal of K gives the marginal probabilities of inclusion for individual elements of Y. Diagonal entries close to 1 correspond to elements of Y selected with high probability. The matrix K is defined by a kernel function \(k(x_0,x_1)\). A kernel is a two-argument real-valued function over \(\mathcal {X}\times \mathcal {X}\) such that for any \(x_0,x_1 \in \mathcal {X}\):

$$\begin{aligned} k(x_0,x_1) = \langle \phi (x_0),\phi (x_1)\rangle _\mathcal {F}, \end{aligned}$$
(10)

where \(\mathcal {X}\) is a vector space and \(\mathcal {F}\) is a inner-product space such that \(\forall x \in \mathcal {X}: \phi (x) \in \mathcal {F}\). Specifically, we use a exponential kernel based on the Wasserstein-2 distance W(pq) between two probability distributions p and q. The \(p^\text {th}\) Wasserstein distance between two probability measures p and q in \(P_p(M)\) is defined as

$$\begin{aligned} W_p (p, q):=\left( \inf _{\gamma \in \Gamma (p, q)} \int _{M \times M} d(x, y)^p \, \mathrm {d} \gamma (x, y) \right) ^{1/p}, \end{aligned}$$
(11)

where \(\Gamma (p,q)\) denotes the collection of all measures on \(M \times M\) with marginals p and q on the first and second factors. Let p and q be two isotropic Gaussian distributions and \(W_2^2(q,p)\) the Wasserstein-2 distance between p and q. The exponential Wasserstein-2 kernel is then defined by

$$\begin{aligned} k(p,q) = \exp \left( -\frac{W_2^2(p,q)}{2h^2}\right) , \end{aligned}$$
(12)

where h is the kernel width. We show in Appendix B that Eq. (12) gives a valid kernel. This formulation has two properties that make it suitable for our purpose. Firstly, the Wasserstein distance is symmetric, i.e., \(W_2^2(p,q) = W_2^2(q,p)\), which in turn will lead to a symmetric kernel matrix. This is not true for other similarity measures on probability distributions, such as \({{\,\mathrm{\text {D}_\text {KL}}\,}}\) (Cover & Thomas, 2012). Secondly, if p and q are Gaussian and mean-field approximations, i.e., covariance matrices \(\Sigma _p\) and \(\Sigma _q\) are given by diagonal matrices, such that \(\Sigma _p = \text {diag}(d_p)\) and \(\Sigma _q = \text {diag}(d_q)\), \(W_2^2(p,q)\) can be computed in closed form as

$$\begin{aligned} W_2^2(p,q) = \vert \vert \mu _p - \mu _q\vert \vert _2^2 + \vert \vert \sqrt{d_p} - \sqrt{d_q}\vert \vert _2, \end{aligned}$$
(13)

where \(\mu _{p,q}\) are the means and d pq the diagonal entries of distributions p and q. We provide a more detailed derivation of Eq. (13) in Appendix A. For each layer l with N experts the following regularization objective is added to the main objective:

$$\begin{aligned} \max \sum _{l=1}^L \mathrm {det}(K_l) = \sum _{l=1}^L \mathrm {det} \begin{pmatrix} 1 &{} k(p_1^l, p^l_2) &{} \cdots &{} k(p^l_1, p_N^l) \\ k(p^l_2, p^l_1) &{} 1 &{} \cdots &{} k(p^l_2, p^l_N) \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ k(p^l_N, p^l_0) &{} k(p^l_N, p^l_2) &{} \cdots &{} 1 \\ \end{pmatrix} \end{aligned}$$
(14)

where L is the number of Layers, \(k(p^l_i,p^l_j)\) is the kernel of the i-th and j-th expert of layer l (see Eq. 10) and \(\mathrm {det}(K)\) denotes the matrix determinant of the kernel matrix K. Note that the matrix is symmetric, which reduces computation time. Computing \(\mathrm {det}(K)\) can have some pitfalls, which we discuss in Sect. 4.2.

From a geometric perspective, the determinant of the kernel matrix represents the volume of a parallelepiped spanned by feature maps corresponding to the kernel choice. We seek to maximize this volume, effectively filling the parameter space—see Fig. 4 for an illustration.

3 Experiments

Table 1 Continual learning results in the split MNIST (S-MNIST) and permuted MNIST (P-MNIST) benchmark compared to current CL methods

While the correct and robust evaluation of continual learning algorithms is still a topic of discussion (Farquhar & Gal, 2018; Hsu et al., 2018), we follow the majority of studies to ensure a fair comparison. We evaluate our approach in current supervised learning benchmarks in Sect. 3.1, in a generative learning setting in Sect. 3.2, and in the continual reinforcement learning setup in Sect. 3.3. Additionally, we conduct ablation studies in Sect. 3.4 to investigate the influence of the diversity objective, the generator quality, the influence of the hyper-parameters \(\beta _1\) and \(\beta _2\), and regarding the number of experts. We give experimental details in Appendix C.

3.1 Continual supervised learning scenarios

The basic setting of continual learning is defined as an agent which sequentially observes data from a series of tasks \(\mathcal {T} = \{T_i\}_{i=1}^N\) and must learn \(T_i\) while maintaining performance on older tasks \(T_{\le i}\). We evaluate the performance of our method in this setting in split MNIST (Fig. 1), split CIFAR-10 and split CIFAR-100 (Fig. 2). We follow the domain incremental setup (van de Ven et al., 2020; Kessler et al., 2021; Raghavan & Balaprakash, 2021; Hsu et al., 2018; Mazur et al., 2021; He & Zhu, 2022), where the number of classes is constant and task information is not available, but we also compare against task-incremental methods (Zenke et al., 2017; Shin et al., 2017; Nguyen et al., 2017; Kirkpatrick et al., 2017; Li & Hoiem, 2017; Rao et al., 2019; Sokar et al., 2021; Yoon et al., 2018; Han & Guo, 2021; Chaudhry et al., 2021, 2018) in Appendix C Tables 3 and 4, where the task information is available, to give a complete overview of current methods. The different Continual Learning setups are given in Fig. 2 in more detail.

The first benchmark builds on the MNIST dataset. Five binary classification tasks from the MNIST dataset arrive in sequence: 0/1, 2/3, 4/5, 6/7, and 8/9. In time step t, the performance is measured as the average classification accuracy on all tasks up to task t. In permuted MNIST the task received at each time step t consists of labeled MNIST images whose pixels have undergone a fixed random permutation. The second benchmark is a variation of the CIFAR-10/100 datasets. In Split CIFAR-10, we divide the ten classes into five binary classification tasks. In total, CIFAR-10 consists of 60000 \(32\times 32\times 3\) images in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images. CIFAR-100 is like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. Tasks are defined as a 10-way classification problem, thus forming ten tasks in total.

We achieve comparable results to current state-of-the-art approaches (see Tables 1 and 2) on all three supervised learning benchmarks.

3.2 Generative continual learning

Table 2 Continual learning results in the split CIFAR-10 and the split CIFAR-100 benchmark compared to current CL methods

Generative CL is a simple but powerful paradigm (Shin et al., 2017; Rebuffi et al., 2017; van de Ven et al., 2020). The main idea is to learn the data generating distribution and simulate data of previous tasks. We can extend our approach to the generative setting by modeling a variational autoencoder using the novel layers we propose in this work. We provide hyper-parameters and other experimental settings in Appendix C.

Modeling the latent variable z to capture the dynamics the data generating distribution p(x) is difficult if p(x) is multi-modal and authors have suggested the use of more complex distributions (Hadjeres et al., 2017; Ghosh et al., 2019; Vahdat & Kautz, 2020) as variational prior p(z). We model the distribution of the latent variable z in the variational autoencoder by using a densely connected MoVE layer with 3 experts. Using multiple experts enables us to capture a richer class of distributions than a single Gaussian distribution could, as is usually the case in VAEs. We can interpret this as z following a Gaussian Mixture Model, whose components are mutually exclusive and modeled by experts. We integrate the generated data by optimizing a mixture of the loss on the new task data and the loss of the generated data:

$$\begin{aligned} L(\theta ) = \frac{1}{2\vert B_{t}\vert } \sum _{b\in B_t}\ell (b) + \frac{1}{2\vert B_{1:t-1}\vert } \sum _{b\in \vert B_{1:t-1}\vert }\ell (b), \end{aligned}$$
(15)

where \(B_t\) is batch of data from the current task, \(B_{1:t-1}\) a batch of generated data (instead of stored data from previous tasks), and \(\ell (b)\) a loss function on the batch b. We were able to improve our results in the supervised settings by incorporating a generative component as a replay mechanism, as we show in Table 1, and in Fig. 2.

3.3 Continual reinforcement learning

In the continual reinforcement learning (CRL) setting, the agent is tasked with finding an optimal policy in sequentially arriving reinforcement learning problems. To benchmark our method in this setting, we follow the experimental protocol of Ahn et al. (2019) and use a series of reinforcement learning problems from the PyBullet environments (Coumans & Bai, 2016–2021; Ellenberger, 2018–2019). In particular, we use the following: Walker2D, Half Cheetah, Ant, Inverted Double Pendulum, and Hopper. The environments we selected have different states and action dimensions. This implies we can’t use a single neural network to model policies and value functions. To remedy this, we pad each state and action with zeros to have equal dimensions. The Ant environment has the highest dimensionality with a state dimensionality of 28 and an action dimensionality of 8. All others are zero-padded to have this dimensionality. We provide hyper-parameters and other settings in Appendix C.

Our approach to continual reinforcement learning can build upon any deep reinforcement learning algorithm (see Wang et al., 2020 for a review of current algorithms). Here, we chose soft actor-critic (SAC) (Haarnoja et al., 2018). We extend SAC by implementing all neural networks with MovE layers. When a new task arrives, the old posterior over the expert parameters and the gating posterior become the new priors. After each update step in task t, we evaluate the agent in all previous tasks \(T_{1:t}\) for three episodes each. We divide the reward achieved during evaluation by the mean reward during training and report the cumulative normalized reward, which gives an upper bound of t in the t-th task.

Fig. 5
figure 5

Continual reinforcement learning. The upper row shows the cumulative normalized rewards over time across the five reinforcement learning tasks. Each vertical dotted line indicates a change in environment after 1 million frames. Performance is computed according to Eq. (16) after each frame and thus represents the total reward achieved by the agent summed over t tasks (with a maximum normalized reward of one per task). Thus, a value close to t indicates no forgetting, while 1.0 shows the total forgetting of old tasks. The lower row shows the absolute cumulative episodic reward in each of the five environments given pre-training on the preceding tasks. We compare against EWC (Kirkpatrick et al., 2017) and UCL (Ahn et al., 2019). The lower bound is given by a naively trained dense neural network using SAC (Haarnoja et al., 2018) without CL and without access to old environments

We compare our approach against a simple continuously trained SAC implementation with dense neural networks, EWC (Kirkpatrick et al., 2017), and the recently published UCL (Ahn et al., 2019) method. UCL is similar to our approach in that it also employs Bayesian neural networks, but the weight regularization acts on a per-weight basis. Note that, in contrast to our approach, UCL and EWC both require task information to compute task-specific losses. Our results (see Fig. 5) show that our approach can sequentially learn new policies while maintaining an acceptable performance on previously seen tasks. We evaluate the methods by computing the following score J(t) for each time step after training on task t is complete:

$$\begin{aligned} J(t) = \sum _{k=1}^t \frac{1}{NR_{\mathrm {avg}}(k, \pi _{k})}\sum _{n=1}^N R_n(k, \pi _{t}), \end{aligned}$$
(16)

where N is the number of episodes to average over, and \(R(k, \pi _{k})\) is the cumulative episodic reward in environment k under policy \(\pi _{k}\):

$$\begin{aligned} R(k, \pi _{k}) = \mathbb {E}_{a_i\thicksim \pi _{k}(s_i), s\thicksim p_k(s_{i+1}\vert a_i, s_i)}\left[ r(s_i, a_i)\right] , \end{aligned}$$
(17)

where \(p_k(s_{i+1}\vert a_i, s_i)\) are the dynamics of environment k and \(r(s_i,a_i)\) is the immediate reward for executing action \(a_i\) in state \(s_i\) at time step i. The policy \(\pi _{k}\) refers to the policy trained on tasks up to k. The normalization is done by dividing by the average performance over ten episodes of the agent in task k, when the agent was trained on that task for the first time. Thus we get a score that measures how well the agent retains its performance on a past environment while learning to solve new problems. As we sum over past environments, an increasing score indicates a successful trade-off between learning and forgetting, while a decreasing or stagnating score indicates forgetting of old tasks.

Our method outperforms UCL (Ahn et al., 2019) and EWC (Kirkpatrick et al., 2017). In this setting naively training the agent sequentially (labeled "Dense") yields poor performance. This behavior indicates the complete forgetting of old policies. The bottom row of Fig. 5 shows the performance of our method in any particular environment, when pre-training on the preceding environments. It shows that the other methods (UCL and EWC) adapt more successfully to the individual tasks, which is however, coupled with catastrophic forgetting, when switching to the next task. In contrast, our method achieves a better trade-off between learning and forgetting. For this result, we did not optimize hyper-parameters for single task performance, but we simply set the hyper-parameters such that the training performance in each task was comparable in order of magnitude to the other methods. The evaluation across tasks with the same normalized metric shows then the superior ability of HVCL to maintain performance over a sequence of reinforcement learning tasks. Additionally, we note that the variance of the results achieved by our method are lower, suggesting a more stable and reliable training phase.

3.4 Ablation studies

To further investigate the methods we propose in this work, we designed a set of ablation experiments. In particular, we aim to demonstrate the importance of each component. To this effect, we run experiments investigating the generator quality in the generative CL setting, study the diversity bonuses in the supervised CL scenario, and take a closer look at the number of experts and the influence of the \({{\,\mathrm{\text {D}_\text {KL}}\,}}\) weights in the continual reinforcement learning setup.

3.4.1 Investigating diversity bonuses

In Sect. 2.2 we introduced a diversity objective to stabilize learning in a mixture-of-experts system. Additionally, we argued in favor of an entropy bonus to encourage a selection policy that favors high certainty and sparsity. To investigate the validity of these additions, we run a set of experiments on the Split CIFAR-10 dataset as described in Sect. 3, but with different bonuses—see Fig. 6. In the baseline setup, we used no other objectives as those described by Eq. (6).

Apart from the classification accuracy, we are interested in three information-theoretic quantities that allow us to investigate the system closer. Firstly, the mutual information between the data generating distribution p(x) and the expert selection \(p(m\vert x)\) as measured by I(MX) indicates how much uncertainty over m the gating unit can reduce on average after observing an input x. A higher value means that inputs are differentiated better, which is what we would expect from a more diverse set of experts. I(MX) is the highest when we use a DPP-based diversity objective ("DDP"), while the entropy of selection policy H(M) is lowest when we use an entropy-based diversity measure ("H-Penalties"), which both show that the objectives we introduced in this study yield the intended results. Combining both the DDP diversity bonus and the entropy penalty on the expert ("DDP+H-Pen.") enforces a trade-off between both objectives and yields the best empirical results. We average the results of ten random seeds in each setting.

Fig. 6
figure 6

Here we evaluate the proposed diversity measures in the split CIFAR-10 benchmark. We averaged every experiment over three random trials. Information-theoretic quantities I(MX), I(CX),  and H(X) were measured for each layer and averaged

3.4.2 Generator quality

Fig. 7
figure 7

Left and middle: Information-theoretic measures for the generator quality. Right: The first two rows show original images and the corresponding reconstructions. The following five rows show samples generated from the VAE after each task (automobiles vs. airplanes, birds vs. cats, deers vs. frogs, dogs vs. horses, and ships vs. trucks)

We introduce a generative approach to continual learning in Sect. 3.2 by implementing a Variational Auto-encoder using our proposed layer design. This addition improved classification performance and mitigated catastrophic forgetting, as evidenced by the results shown in Fig. 2 and Table 1.

By borrowing methods from the generative learning community, we can investigate the performance further. The main focus lies on the quality of the generated images. We can not straightforwardly measure the accuracy, as artificial images lack labels. Thus we first use the trained classifier to obtain labels and compute metrics based on these self-generated labels. We opted for the Inception Score (IS) (Salimans et al., 2016), as it is widely used in the generative learning community. In this initially proposed formulation, the IS builds on the \({{\,\mathrm{\text {D}_\text {KL}}\,}}\) between the conditional and the marginal class probabilities as returned by a pre-trained Inception model (Szegedy et al., 2015). To investigate the quality of the generated images concerning the continually trained classifier, we use a different version of the Inception Score, which we defined as

$$\begin{aligned} IS_{T}(G_{1:T}) = \mathbb {E}_{x\thicksim G_{1:T}}\left[ {{\,\mathrm{\text {D}_\text {KL}}\,}}\left[ p_{1:T}(y\vert x) \vert \vert p_{1:T}(y)\right] \right] , \end{aligned}$$
(18)

where \(G_{1:T}\) is the data generator trained on tasks up to T, \(p_{1:T}(y\vert x)\) the conditional class distribution returned by the classifier trained up to task T, and \(p_{1:T}(y)\) the marginal class distribution up to Task T. Note that, \(IS(G_{1:T}) \le \log _2 N_c\), where \(N_c\) is the number of classes. We show \(IS_{T}\) and the entropy of p(y) in the split CIFAR-10 and CIFAR-100 setting in Fig. 7. In both cases, it can be seen that the generated pictures retain task-specific information, although there is a notable decrement across tasks.

3.4.3 Number of experts

Fig. 8
figure 8

Experimental results for systems with 1, 2, and 4 experts. As expected, adding experts mitigates forgetting. Each curve represents three trials in the continual reinforcement learning domain as described in Sect. 3.3

Our method builds on a mixture of experts model and it is thus natural to assume that increasing the number of experts improves performance. Indeed, this is the case as we demonstrate in additional continual reinforcement learning experiments in Fig. 8. As Fig.  1 illustrates, adding experts to layers increases the number of possible information processing paths through the network. Equipped with a diverse and specialized set of parameters, each path can be regarded as a distinct sub-network that learns to solve tasks.

3.4.4 \({{\,\mathrm{\text {D}_\text {KL}}\,}}\) weights

As with any hyper-parameter, setting a specific value for \(\beta\) has a strong influence on the outcome of the experiments. Setting it too small will lead to the regularization term dominating the loss, and the experts can’t learn a new task, as the new parameters remain close to the parameters of the previous task. A high value will drive the penalty term towards zero, which, in turn, will not preserve parameters from old tasks. In principle, there are three ways to choose \(\beta\).

First, by setting \(\beta\) such that it satisfies an expert information-processing limit. This technique has the advantage that we can interpret this value, e.g., "each expert can process 1.57 bits of information on average, i.e., distinguishing between three options", but shifts the burden from picking \(\beta\) to setting a target entropy (see, e.g., Haarnoja et al., 2018; Grau et al., 2019 for an example of this approach). Second, employing a schedule for \(\beta\), as, e.g., proposed by Fu et al. (2019). Last, another option is to run a grid search over a pre-defined range and choose the one that fits best. In our supervised learning experiments, we used a cyclic schedule for \(\beta _1\) and \(\beta _2\) (Fu et al., 2019) while we kept them fixed in the reinforcement learning experiments. To systematically investigate the influence of these parameters, we conducted additional experiments (see Fig. 9).

4 Discussion

Fig. 9
figure 9

In this figure, we show the influence of the \({{\,\mathrm{\text {D}_\text {KL}}\,}}\) weights \(\beta _1\) and \(\beta _2\) in the continual reinforcement learning setting. Setting the expert \({{\,\mathrm{\text {D}_\text {KL}}\,}}\) weight \(\beta _2 < 0.1\) results in poor performance, as posteriors deviate too much from their priors. On the other hand, a lower gating \({{\,\mathrm{\text {D}_\text {KL}}\,}}\) weight \(\beta _1\) allows for a flexible expert allocation and improves performance

4.1 Related work

The principle we propose in this work falls into a wider class of methods that deal more efficiently with learning and decision-making problems by integrating information-theoretic cost functions. Such information-constrained machine learning methods have enjoyed recent interest in a variety of research fields, such as reinforcement learning (Eysenbach et al., 2018; Ghosh et al., 2018; Leibfried & Grau-Moya, 2019; Hihn et al., 2019; Arumugam et al., 2020), MCMC optimization (Hihn et al., 2018; Pang et al., 2020), meta-learning (Rothfuss et al., 2018; Hihn & Braun, 2020), continual learning (Nguyen et al., 2017; Ahn et al., 2019), and self-supervised learning (Thiam et al., 2021; Tsai et al., 2021).

The hierarchical structure we employ is a variant of the Mixture of Experts (MoE) model. Jacobs et al. (1991) introduced MoE as tree-structured models for complex classification and regression problems, where the underlying approach is the divide and conquer paradigm. As in our approach, three main building blocks define MoEs: gates, experts, and a probabilistic weighting to combine expert predictions. Learning proceeds by finding a soft partitioning of the input space and assigning partitions to experts performing well on the partition. MoEs for machine learning have seen a growing interest, with a recent surge stemming from the introduction of the sparsely-gated mixture-of-experts layer (Shazeer et al., 2017). In its initial form, this layer is optimized to divide the inputs equally among experts and then sparsely activate the top-k experts per input. This allowed training systems with billions of total parameters, as only a small subset was active at any given time. In our work, we removed the incentive to equally distribute inputs, as we aim to find specialized experts, which contradicts a balanced load. The computational advantage remains, as we still activate only the top-1 expert.

We extended the sparse MoE layer to continual learning by re-formulating its main principle as a hierarchical extension of variational continual learning (Nguyen et al., 2017). Our main contribution is removing the need for multi-headed networks by moving the head selection to the gating network. Our method is similar to the approach described in Hihn and Braun (2020) but differs in two key aspects. Firstly, we provide a more stable learning procedure as our layers can readily offer end-to-end training, which alleviates problems such as expert class imbalance and brittle converging properties reported in the previous study. Secondly, we implement the information-processing constraints on the parameters instead of the output of the experts, thus shifting the information cost from decision-making to learning. A method similar to ours is conditional computing for continual learning (Lin et al., 2019). The authors propose to condition the parameters of a neural network on the input samples by learning a (deterministic) function that groups inputs and maps a set of parameters to each group. Our approach differs in two main ways. First, our method can capture uncertainty allowing us to learn stochastic tasks. Second, our design can incorporate up to \(2^n\) paths (or groupings) through a neural net with n layers, making it more flexible than learning a mapping function. Another routing approach is routing networks by Collier et al. (2020). The authors propose to use mixture-of-experts layers that they train with a novel algorithm called co-training. In co-training, an additional data structure keeps track of all experts assigned to a specific task, as these are trained differently than those unassigned so far. In contrast, our method does not require additional training procedures and does not produce any organizational overhead. PathNet (Fernando et al., 2017) is a modular neural network architecture where an evolutionary algorithm combines modules (e.g., convolution, max pooling, activation) to solve each task. This approach requires two training procedures: one for the evolutionary algorithm and one to adapt the path modules. Our method allows efficient end-to-end training. Lee et al. (2020) propose a Mixture-of-Experts model for continual learning, in which the number of experts increases dynamically. Their method utilizes Dirichlet-Process-Mixtures (Antoniak, 1974) to infer the number of experts. The authors argue that since the gating mechanism is itself a classifier, training it in an online fashion would result in catastrophic forgetting. To remedy this, they implement a generative model per expert m to model \(p(m\vert x)\) and approximate the output as \(p(y\vert x) \approx \sum _m p(y\vert x)p(m\vert x)\). In our work, we have demonstrated that it is possible to implement a gating mechanism based only on the input by coupling it with an information-theoretic objective to prevent catastrophic forgetting.

To stabilize expert training we introduced a diversity objective. Diversity measures have witnessed increasing interest in the reinforcement learning community. The ”diversity is all you need” (DIAYN) paradigm (Eysenbach et al., 2018) proposes to formulate an information-processing hierarchy similar to Eq. (8), where the information bottleneck on the latent variable discards irrelevant information while an entropy bonus increases diversity. DIAYN acts as intrinsic motivation in environments with no rewards and enables efficient policy learning. Lupu et al. (2021) investigate diversity as a means to create a set of policies in a multi-agent environment. They define diversity between two policies based on a generalization of the Jensen-Shannon divergence and optimize this objective and the main goal defined by the environment simultaneously. Parker et al. (2020) introduce a DDP-based method, that aims to promote diversity in the policy space by defining it as a measure of the different states a policy may reach from a given starting state. Dai et al. (2021) propose another DDP-based method, where the idea is to augment the sampling process in hindsight experience replay (Andrychowicz et al., 2017) (a method for off-policy reinforcement learning) with a diversity bonus. The method we propose to encourage expert diversity differs from previous methods as we define diversity in parameter space instead of the policy outcomes or inputs. In combination with a \({{\,\mathrm{\text {D}_\text {KL}}\,}}\) penalty, this allows us to optimize for efficient information-processing, which facilities efficient continual learning. Additionally, as we define it on parameters instead of actions, we can apply it straightforwardly to any problem formulation, as our extensive experiments show.

Currently, there are only few methods that perform well in supervised continual learning and continual reinforcement learning (e.g., Ahn et al., 2019; Jung et al., 2020; Cha et al., 2020). These methods require task information, as the either keep a set of separate task-specific heads (Ahn et al., 2019; Jung et al., 2020) or compute task-specific losses (Cha et al., 2020). We were able to achieve comparable results to Uncertainty-based continual learning (UCL) (Ahn et al., 2019). This makes our method one of the first task-agnostic CL approaches to do so.

4.2 Critical issues and future work

One drawback of our method is the number of hyper-parameters it introduces. Specifically, the weighting factors for the gating and the expert \({{\,\mathrm{\text {D}_\text {KL}}\,}}\) constraint, the diversity bonus and its kernel parameters, and the number of experts and the top-k settings. To optimize them in our experiments, we ran a hyperparameter optimization algorithm on a reduced variant of the problem and fine-tuned them on the complete dataset. To further mitigate this problem, we used a scheduling scheme for the \({{\,\mathrm{\text {D}_\text {KL}}\,}}\) weights, as discussed in more detail in Sect. 3.4.4.

Moreover, variational inference in neural networks can be computationally expensive (Gal and Ghahramani 2016; Zhang et al., 2018; Freitas et al., 2000), as one has to draw samples for each forward pass to approximate gradients. We tackle this issue in three ways. Firstly, we use \(D-\)dimensional Gaussian mean-field approximate posteriors \(q_t(\theta )= \prod ^D_{d=1}\mathcal {N}(\theta _t\vert p_{t,d},\sigma ^2_{t,d})\) to model distributions. This allows us to compute the \({{\,\mathrm{\text {D}_\text {KL}}\,}}\) (as well as the diversity measures) in closed form. Secondly, we use the flip-out estimator (Wen et al., 2018) to approximate the gradients, which is known to have a low variance. In practice, we draw a single sample to approximate the expectation. Lastly, our top\(-k\) sampling allows us to only activate k of the posteriors, which means that the remaining parameters have zero gradient, and accordingly do not have to be updated (Tensorflow 2022). As a consequence, our HVCL approach requires a similar amount of training episodes and computation time as non-probabilistic network representations.

We have observed in simple toy problems that expert allocation and optimization greatly depend on the initialization of the experts. A sufficiently diverse expert initialization yielded better results while requiring fewer iterations. We could not directly transfer this to more complex learning problems because this would introduce computational instabilities during back-propagation, as we lose the variance reducing benefits of state-of-the-art initialization schemes (Glorot et al., 2010; He et al., 2015; Narkhede et al., 2021). This is also part of the reason why we chose to implement a diversity objective instead of a novel initializer. We leave this as a topic for future work.

To optimize the diversity measure, we have to compute the determinant of the kernel matrix (see Sect. 2.2) by evaluating the kernel function for each expert pair and finding the derivative of its determinant. While we showed that the Wasserstein-2 distance between Gaussian distributions with diagonal covariances is available in closed form (see Eq. 13), and that the resulting kernel matrix is symmetric, there is still the issue of the derivative of the determinant. Following Jacobi’s formula to do so, requires finding the inverse of the kernel matrix. This operation fails when the kernel matrix is not invertible, which is equivalent to the determinant being zero. This is the case if diversity is also zero, i.e., the expert parameters are pairwise nearly identical. We countered this by setting the kernel width h (see Eq. 12) to a sufficiently large value, which we found by a simple grid search. This problem requires further investigation, as it can impede the complete optimization process.

Our model requires a fixed number of experts. In a more realistic continual learning setting, the number of tasks may grow such that the system may benefit from additional experts. This could be realized by taking a Bayesian non-parametric approach which treats the number of experts as a variable. Dirichlet-Process-Mixtures (Antoniak, 1974) offer such flexibility, with recent applications to meta-learning (Jerfel et al., 2019), and to continual learning (Lee et al., 2020).

5 Conclusion

We introduced a novel hierarchical approach to task-agnostic continual learning, derived an immediate application, and extensively evaluated this method in supervised continual learning and continual reinforcement learning. The method we introduced builds on a hierarchical Bayesian inference view of continual learning and is a direct extension of Variational Continual Learning (VCL). This adaptive mechanism allowed us to remove the need for extrinsic task-relevant information and to operate in a task-agnostic way. While we removed this limitation, we achieved results competitive to task-aware and to task-agnostic algorithms. These insights allowed us to design a diversity objective that stabilizes learning and further reduces the risk of catastrophic forgetting. In particular, we could show that enforcing expert diversity through additional objectives stabilizes learning and further reduces the risk of catastrophic forgetting. Essentially, catastrophic forgetting is avoided by having multiple experts that are responsible for different tasks and are activated in different contexts and trained only with their assigned data, which avoids weight overwriting and negative interference. Finally, having more diverse experts leads to crisper task partitioning with less interference.

As our method builds on generic utility functions, we can apply it independently of the underlying optimization problem, which makes our method one of the first to do so and achieve competitive results in continual supervised learning benchmarks based on variants of the MNIST, CIFAR-10, and CIFAR-100 datasets. In continual reinforcement learning, we evaluated our method on a series of challenging simulated robotic control tasks. We also demonstrated how our method gives rise to a generative continual learning method. Additionally, we conducted ablation experiments to analyze our approach in a detailed and systematic way. These experiments confirmed that the additional objectives we introduced enhance expert partitioning and enforce a sparse expert selection policy. This leads to specialized and diverse experts, which alleviates catastrophic forgetting. We also investigated the impact of the hyper-parameters our method introduces to highlight how they help to mitigate catastrophic forgetting. In the generative setting, we took a closer look at the performance of the generative model by introducing a continual learning version of the widely used Inception Score. Finally, we showed that increasing the number of experts reduces forgetting in the continual reinforcement learning scenarios.