Introduction

It is crucial for animals to infer the identity of odors, in situations ranging from foraging to mating1. While some odors are hardwired2, most must be learned. Learning, however, is particularly difficult, especially in natural environments where odors are rarely presented in isolation, most odors are presented a small number of times, and odor identities are rarely supervised. Nevertheless, animals can learn to associate an odor with a reward in a few trials3,4,5. Our goal here is to elucidate the local plasticity mechanisms that orchestrate this rapid learning.

To gain a conceptual understand of how learning occurs, note that if the affinities of olfactory receptor neurons (OSNs) to odors were known, approximate Bayesian inference could be used to infer which odors are present given OSN activity6. And in a supervised setting—a setting in which the animal is told which odors are present—the affinities (i.e. the weights) could be learned efficiently using recently proposed Bayesian approaches7,8. Here we show that, even when the weights are not known and learning is unsupervised, we can combine these two methods to simultaneously learn the weights and infer the odors.

Our approach is as follows: when inferring which odors are present, average over the uncertainty in the weights; then use the inferred odors to update the estimates of the weights, and, importantly, decrease the uncertainty. As the estimates of the weights become more accurate, inference also improves. However, while straightforward, exact implementation of this learning process is intractable. Consequently, we have to use an approximate method9.

Although inference is approximate, our model still leads to faster learning of olfactory stimuli compared to previously proposed sparse-coding-based approaches10,11,12. It also provides some insight into olfactory circuitry: it reveals the advantage, relative to the rectified linear transfer function13, of sigmoidal-shaped fI curves typical of biological neurons14,15, and it reproduces the reduction in neuronal input gain16,17 and learning rate18 commonly observed during development. In addition, it predicts that the learning rate of granule cells should decrease as they become more selective, and thus exhibit lower lifetime sparseness19,20, something that is possible (although difficult) to test experimentally. And finally, we extended our model to an odor–reward association task, and found that learning of a concentration invariant representation at the piriform cortex helps rapid odor–reward association.

While our approach gives us a model that is reasonably consistent with mammalian olfactory circuitry, the architecture predicted by our approximate Bayesian algorithm does not perfectly match the architecture of the olfactory system. However, a plausible olfactory circuit based on our model, but with the addition of recurrent inhibition among piriform neurons21, still learns to perform reward-based learning quickly. These results suggest that even at the circuit level, approximate Bayesian optimization may underlie rapid biological learning. But at the same time, our study reveals its limitation when applied to a complicated system.

Results

Problem setting

Let us denote odor concentrations by a vector c = (c1, . . . , cM), where cj > 0 if odor j is present and cj = 0 otherwise. By odor, we mean something like the odor of apple or coffee, not a single odorant molecule. In a typical environment, odors are very sparse, in the sense that few of them have a significant presence (i.e. cj > 0 for a small number of j at any time; Fig. 1 left).

Fig. 1: Problem setting.
figure 1

An example odor stimulus, c (left), and the response at the glomeruli, x (right). The mixing weights (i.e., affinities), w (which are unknown to the animal) map odors, with concentration c, to OSN activity accumulated at the glomeruli, x. A goal of the animal is to infer the odor concentrations from the glomeruli activity.

In the olfactory system, odors are first detected by OSNs, and then transmitted to glomeruli as spiking activity22. Neural activity accumulated at a glomerulus, denoted xi for ith glomerulus (and thus ith OSN receptor type), is, approximately

$${x}_{i}={\sum }_{j}{w}_{ij}{c}_{j}+n,$$
(1)

where n is the noise due to sensory variability and unreliable OSN-spiking activity, and the affinity, or the mixing weight, wij, determines how strongly odor j activates glomerulus i (Fig. 1 right). OSN activity shows a roughly logarithmic dependence on odor concentration23,24. Thus the amplitude, cj, of each odor reflects log-concentration, not concentration. Below a threshold, here taken to be zero, odors are considered undetectable.

Olfactory learning as Bayesian inference

The goal of the early olfactory system is to infer which odors are present and what their concentrations are, based on OSN activity, x. However, this is a difficult problem because the animal does not know the mixing weights, w, but instead has to learn them, without supervision. One common approach to this type of unsupervised learning is the sparse coding model. Its associated learning algorithm is, however, inefficient, and thus slow, as we will see below (see the subsection “Sparse coding” in the Methods section). We thus turn to Bayesian inference.

The Bayesian approach is efficient because it takes into account uncertainty in both odor, c, and weight, w, and it can naturally incorporate a prior that reflects the sparseness of the olfactory environment. The steps are straightforward: first write down, from Eq. (1), an expression for p(cxw), the distribution over odor concentrations given glomeruli activity, x, and weights w; then marginalize over the distribution of the weights given all the previous inputs, p(w∣ past observations of x) (see Methods section, Eq. (10)). However, exact marginalization is neither computationally tractable nor biologically plausible. We therefore employ a variational Bayesian approximation9, by replacing the true joint probability distribution with a fully factorized one. The effect of making a variational approximation is illustrated in Fig. 2c: the distribution of a pair of odors are typically slightly anti-correlated (Fig. 2c, left), while the variational distribution is independent (Fig. 2c, right). Because the anti-correlation is typically weak, the variational distribution captures the true distribution well.

Fig. 2: Bayesian inference of odors and weights.
figure 2

a Inference of odor concentration. Combining the likelihood q(xc) (left) and the prior pc(c) (middle), the posterior distribution q(cx) is obtained (right). The orange dashed line is the mean concentration associated with the likelihood, q(xc); the black dashed line is the mean associated with the posterior, q(cx). Because the prior strongly favors the absence of odors, the latter is shifted to lower concentration. b Illustration of the weight update given the same sensory evidence Δqt(wx) when the previously estimated probability distribution over the weights, qt−1(w), is broad (left), and narrow (right). Note that the mean of qt−1(w) is the same in both panels. c Illustration of the variational approximation. The true posterior over the joint distribution of odors c1 and c2, p(c1c2x) (left), is approximated by a factorized distribution q(c1x)q(c2x) (right). The black cross indicates the true concentrations, and colored lines are contours of equal probability.

The derivation of the algorithm for variational inference is described in detail in Methods section; here we simply give the results. The variational probability distribution of the concentration of odor j is updated iteratively as (see Methods section, Eq. (14b))

$$q({c}_{j}| {\bf{x}})\propto q({\bf{x}}| {c}_{j}){p}_{{\mathrm{{c}}}}({c}_{j})$$
(2)

where q(xcj) is the variational likelihood of the concentration of the jth odor, cj, given x, and pc(cj) is the prior distribution over cj. We take the noise, n, in Eq. (1) to be Gaussian, so q(xcj) is Gaussian (Fig. 2a, left). And to reflect the sparsity, pc(cj) is taken to be a point mass at zero combined with a continuous piece at positive concentration (Fig. 2a, middle). Because, the prior strongly favors the absence of odors, the estimated mean concentration, 〈cq(cx) (dashed black line in Fig. 2a, right), is typically smaller than the mean over the likelihood function, 〈cq(xc) (dashed orange line in Fig. 2a, right).

Similarly, the update rule for the variational probability distribution of a weight is given by (see Methods section, Eq. (14a))

$${q}_{t}({w}_{ij})\propto \Delta {q}_{t}({w}_{ij},{\bf{x}}){q}_{t-1}({w}_{ij}),$$
(3)

where Δqt(wijx) is the evidence provided by the new information, carried in x, at trial t (Fig. 2b) and qt(wij) is the variational probability distribution of the weight, wij, given observations up to trial t (we suppress the time dependence to reduce clutter). Importantly, depending on the uncertainty in the weights, the same stimulus causes different amounts of plasticity. In particular, the higher the uncertainty in the estimated weight, wij, at t−1, the larger the change in the mean weight, Δw (left vs. right in Fig. 2b).

The update rules given in Eqs. (2) and (3) can be mapped onto neural dynamics and synaptic plasticity that closely mirrors the mammalian olfactory bulb (Fig. 3a and b). The firing rate dynamics obeys

$${\tau }_{{\mathrm{{r}}}}\frac{{\mathrm{{d}}}{m}_{i}}{{\mathrm{{d}}}\tau }=-{m}_{i}-\sum_{j = 1}^{M}{w}_{ij}^{{\rm{L}}}{\overline{c}}_{j}+{x}_{i}$$
(4a)
$${\tau }_{{\mathrm{{r}}}}\frac{{\mathrm{{d}}}{\overline{c}}_{j}}{{\mathrm{{d}}}\tau }=-{\overline{c}}_{j}+{F}_{j}\left(\sum_{i = 1}^{N}{w}_{ji}^{{\rm{F}}}{m}_{i}\right)$$
(4b)

where τ denotes time within an odor presentation (not to be confused with t, which refers to trial), mi is the firing rate of the ith M/T (mitral/tufted) cell relative to baseline, and \({\overline{c}}_{j}\) is the firing rate of the jth granule cell. The ith M/T cell is linearly modulated by excitatory input from glomerulus i, via xi, and also by inhibitory input from granule cells, the \({\overline{c}}_{j}\). The granule cells, whose activity correspond to the expected concentration of the odors, are driven by excitatory input from M/T cells, mediated by a nonlinear transfer function Fj. As we discuss below, this nonlinearity plays a critical role in rapid learning.

Fig. 3: Neural implementation of Bayesian learning.
figure 3

a Schematic of the neural architecture. Dotted box represents the internal variables of the brain; the odor, c, comes from the outside world. b The neural implementation of our Bayesian learning model maps almost perfectly onto the circuitry of the olfactory bulb. Dotted circles are glomeruli, green triangles are M/T cells, and blue circles represent olfactory granule cells. Red and blue arrows indicate weights from granule to M/T and M/T to granule cells, respectively. c An example of firing rate dynamics before (left) and after (right) learning (M = 50 odors, N = 400 glomeruli, four odors presented). Different colors represent different neurons. Dotted horizontal lines in the bottom figures represent the true concentration of the presented odors. d Change in the variance of M/T cells during learning (t: trial). The expectation was taken over both population and trials. e Receiver operating characteristic (ROC) curves under different numbers of simultaneously presented odors (M = 100 odors, N = 400 glomeruli). See subsection “ROC curve” in the Methods section for details. f Performance under learning from various odor exposure duration (see subsection “Performance evaluation” in the Methods section), where M = 100, N = 400, and three odors are presented simultaneously, on average. The lines and their error bars are mean and standard deviation over 10 simulations.

The weights in Eq. (4), \({w}_{ij}^{{\rm{F}}}\) and \({w}_{ij}^{{\rm{L}}}\), correspond to M/T-to-granule and granule-to-M/T synapses, respectively (blue and red arrows in Fig. 3b). These synapses jointly form a dendro-dendritic connection between M/T and granule cells25. To keep track of the variational probabilistic distribution qt(wij), both the mean and the variance of each weight need to be updated. The update of the mean is

$${w}_{ji}^{{\mathrm{{F}}},t}=(1-{\delta }_{j}^{w,t}){w}_{ji}^{{\mathrm{{F}}},t-1}+\frac{1/t}{{\rho }_{j}^{t}{\sigma }_{x}^{2}}{\overline{c}}_{j}{m}_{i}$$
(5a)
$${w}_{ij}^{{\mathrm{{L}}},t}=(1-{\delta }_{j}^{w,t}){w}_{ij}^{{\mathrm{{L}}},t-1}+\frac{1/t}{{\rho }_{j}^{t}{\sigma }_{x}^{2}}{m}_{i}{\overline{c}}_{j}$$
(5b)

where mi and \({\overline{c}}_{j}\) are evaluated at the end of the odor presentation. Here \({\delta }_{j}^{w,t}\) is the discount factor and \({\rho }_{j}^{t}\) represents the precision (the inverse of the variance) of the synaptic weights \({w}_{ji}^{{\mathrm{{F}}},t}\) and \({w}_{ij}^{{\mathrm{{L}}},t}\) (see subsection “Synaptic plasticity” in the Methods section for details). This rule is Hebbian, as the update depends on the product of presynaptic and postsynaptic activity mi and \({\overline{c}}_{j}\). It is also adaptive, as the update depends on the precision, \({\rho }_{j}^{t}\): because of the \(1/{\rho }_{j}^{t}\) dependence, low precision (and thus high uncertainty) produces large weight changes while high precision (and thus low uncertainty) produces small weight changes. This is illustrated in Fig. 2b. The precision, \({\rho }_{j}^{t}\), is also updated in an activity-dependent manner (see the Methods section, Eq. (35)). Figure 3c describes typical neural dynamics before and after learning. Before learning, when a mix of four odors is presented, M/T activity quickly converges to constant values with a relatively broad range (Fig. 3c, top-left), and granule cell activity is small and homogeneous (Fig. 3c, bottom-left). After learning, M/T cells exhibit transient activity, followed by convergence to a somewhat smaller range than before learning (Fig. 3c, top-right), as the large input-driven activity is partially canceled by the feedback from the granule cells. Granule cells, on the other hand, show very selective responses, with activity levels roughly matching the concentration of the corresponding odors (Fig. 3c, bottom-right).

The activity profiles of cells in our model have many similarities with experimental observations. For instance, as observed in experiments26, M/T cells show both positive and negative responses relative to baseline (Fig. 3c top, here the baseline is 5), and their responses become more transient after learning (Fig. 3c, top-right, and Fig. 3d). Moreover, the response range of M/T cells becomes smaller as the animal learns the odors (Fig. 3d), as observed experimentally27. In addition, after learning, granule cell activity is strongly modulated by odor concentration (Fig. 3c bottom-right; dotted horizontal lines represent the true concentrations of the corresponding odors), as observed experimentally28.

After learning, the circuit can robustly detect odors with very few false positives, even when several odors are presented simultaneously (Fig. 3e). Moreover, the learning performance was robust with respect to odor presentation time: even if the odors were presented for only a few hundred milliseconds, which corresponds roughly to one sniff cycle29,30, performance remained high (Fig. 3f). Learning was also robust to changes in the prior: a large increase in the range of possible odor concentrations had very little effect on learning performance (Supplementary Fig. 1).

The Bayesian approach is optimal if implemented exactly, but in the approximate model used here, learning is necessarily suboptimal. To determine how suboptimal, we would need to compare against exact inference. However, that is not feasible because exact inference is intractable. Our model does, however, do better than the sparse coding model (Fig. 4): it learns much faster (Fig. 4a), and it achieves high performance without fine tuning, whereas the learning rate of the sparse coding model must be fine-tuned (gray lines in Fig. 4a). This advantage was replicated when we assessed the performance by the error in the weights (Fig. 4b). Despite faster learning, the asymptotic performance of the Bayesian model is similar to that of sparse coding when there are a relatively small number of odor sources in the environment, and much better when there are many sources, although the performance of both models deteriorates in that regime (Fig. 4c).

Fig. 4: Performance comparison.
figure 4

a Learning curves for our model (orange) and sparse coding (light gray to black). M = 100 odors, N = 400 glomeruli, and on average, three odors were presented at each trial. See subsection “Performance evaluation“ in the Methods section for details. The learning rates of the sparse coding model, ηw, were 0.3, 0.5, and 1.0 from light gray to black. b Same as a, but the performance was evaluated by the error in the weight. c Performance (after learning from 4000 trials) of the proposed Bayesian model (orange) and the sparse coding model (gray) versus the number of odors. Shaded regions represent standard deviation over 10 simulations. As in panels a and b, N = 400 glomeruli and three odors were presented on average. Here, ηw was fixed at 0.5.

These results indicate that a variational approximation of Bayesian learning and inference enables data efficient learning, and does so using biologically plausible learning rules and neural dynamics. How does our model manage to perform fast and robust learning? And is there evidence that the brain uses this strategy? Below, we show that our proposed circuit performs well because it exploits the sparseness of the odors and utilizes the uncertainty in both the weights and odor concentration. We then discuss the relationship of our model to experimental observations.

The sparse prior leads to a nonlinear transfer function

An important feature of olfaction, like many real world inference problems, is that the distribution over odors has a mix of discrete and continuous components: an odor may or may not be present (the discrete part), and if it is present its concentration can take on a range of values (the continuous part). In our model, we formalize this with a spike and slab prior (Fig. 2a middle): the spike is the delta function at zero; the slab is the continuous part. In this model, sparseness is ensured by setting the cumulative probability of the slab, denoted co, close to zero.

To see how the prior affects the dynamics, note that the granule cells (\({\overline{c}}_{j}\) in Eq. (4)) represent the expected concentration of the odors, and so take the prior into account. Thus, after learning, most of them have near zero activity, with only a few of them active (Fig. 3c, bottom right panel). To achieve sparsity, the granule cells need a great deal of evidence to report non-negligible concentrations. That is reflected in the transfer functions of the granule cells (the function Fj in Eq. (4b); see orange curve in Fig. 5a). The function exhibits near zero response (corresponding to near zero concentration) for small input, followed by a sharp rise and then an approximately linear response for large input.

Fig. 5: Adaptive transfer functions.
figure 5

a The shapes of the transfer functions of granule cells under different priors on the odor distribution. See subsection “Models with various priors on odor concentration” in the Methods section for the details. b Weight errors under different priors. Shaded regions represent standard deviation over 10 simulations. c The average transfer function \(F[y,\overline{c}=0]\) at the beginning (light gray), middle (gray), and the end (black) of the learning. The x-axis represents the input current y. d The weight error under fixed input gain, compared to the control model with adaptive gain, averaged over 50 simulations. For the gray line, the transfer function was set to the top curve in panel c; for the black line it was set to the bottom curve. In all panels, M = 100 odors, N = 400 glomeruli, and three odors were presented on average.

If we derive update rules using a different prior, the transfer function changes. If we then perform inference and learning using the transfer function derived under a different prior, but drawing odors from the true prior, performance is, not surprisingly, sub-optimal (see subsection “Models with various priors on odor concentration” in the Methods section). For example, if we constrain the odors to be non-negative, the transfer functions are approximately rectified linear, a commonly used nonlinearity in artificial neural networks (gray line Fig. 5a13). However, this model failed to learn the input structure generated from the spike-and-slab prior, as the sparseness of the odor concentration is not taken into account (gray line in Fig. 5b). If we constrain the odors to be non-negative, but also ensure that they are not too large, by introducing an exponential decay10, learning improves initially, but the weight error eventually increases (black lines in Fig. 5a and b). These results suggest that the classic input–output function—sigmoidal at small input and linear at large input—found both in vitro14,31 and in biophysically realistic models of neurons15, reflects the fact that the world is truly sparse—something not captured by classical sparse coding models. These gain functions thus offer a normative explanation for the biophysical responses of typical olfaction neurons to input. The shape of the activation function for the precision update also depends on the choice of prior, but they all closely resemble the squared transfer function, F2 (Supplementary Fig. 2).

As the animal learns a better approximation to the true weights, the olfactory system can extract more information from the OSN activity; this results in a change in the transfer function. In particular, the transfer function exhibits a decrease in gain with learning (mainly a shift to the right), as shown in Fig. 5c (see subsection “The variational weight distribution” in the Methods section for details). Such a decrease in gain is a widely observed phenomenon among diverse neurons during development14,16. It is also consistent with the reduction of input resistance observed in adult-born granule cells during development17,18, as low resistance causes low excitability. If the transfer functions were held fixed during learning, performance would deteriorate gradually (gray and black curves vs. orange line in Fig. 5d), though the benefit of the adaptive gain was rather small in our model setting.

Weight uncertainty leads to adaptive synaptic plasticity

A key aspect of our model is that it explicitly takes the uncertainty of the weights into account. This leads to an adaptive learning rate (see Eq. (5)). In particular, the learning rate is the product of two terms: \((1/t)\times 1/{\rho }_{j}^{t}\). The first term, 1/t, is a global decay, and reflects an accumulation of information over time: at the beginning of learning, the olfactory stimuli contains a relatively large amount of information about the weights, and so the learning rate is large, and vice versa. The second term, \(1/{\rho }_{j}^{t}\), is the cell-specific contribution to the learning rate. In steady state, it is given approximately by \(1/{\rho }_{j}^{t}\propto 1/{\langle {c}_{j}^{2}\rangle }_{{\rm{odors}}}\) (the subscript “odors” indicates an average over odors).

It turns out that the second term is related to the lifetime sparseness, \({S}_{j}\equiv {\langle {c}_{j}\rangle }_{{\rm{odors}}}^{2}/{\langle {c}_{j}^{2}\rangle }_{{\rm{odors}}}\) (note that smaller Sj means activity is more sparse; see subsection “Lifetime sparseness” in the Methods section and ref. 19). Assuming the mean firing rate, \({\langle {c}_{j}\rangle }_{{\rm{odors}}}\), is approximately constant (as we see in our simulations), then \(1/{\rho }_{j}^{t}\propto {S}_{j}\). When the granule cells have broad, non-selective tuning, the lifetime sparseness is large, and the learning rate is high; when the cells are sparse and have highly selective tuning, the lifetime sparseness is low, and so is the learning rate. Thus, if the mean granule cell responses are similar for all presented odors, the learning rate is large, encouraging neurons to modify their selectivity. If, on the other hand, the granule cell responses are sparse and selective, the learning rate is low, helping the neurons stabilize their acquired selectivity.

We examined the effects of the two factors—1/t and \(1/{\rho }_{j}^{t}\)—on learning. When the learning rate, \(1/t{\rho }_{j}^{t}\), was kept constant throughout learning, learning was slower, even when the learning rate was finely tuned (gray lines vs. orange line in Fig. 6a). This makes sense from a Bayesian perspective: early on, when weight uncertainty is large, learning should be fast (the dark gray line, which has the highest learning rate, drops rapidly), whereas after a large number of trials, when weight uncertainty is low, learning should be slow (the lighter gray lines, which have lower learning rates, have better asymptotic performance). It is also consistent with the fine tuning required for the sparse coding model in Fig. 4a and b. When we fixed 1/ρj but included the global factor 1/t, performance was better than the model with fixed learning rate (light-green vs. gray in Fig. 6a), yet still worse than the original fully adaptive model (light-green vs. orange in Fig. 6a). This was more clear under a less sparse setting (co = 0.07 in Fig. 6b, versus co = 0.03 in Fig. 6a). Furthermore, as predicted, we found that the learning rate of a cell, \(1/t{\rho }_{j}^{t}\), is positively correlated with the lifetime sparseness at each time point (i.e. at fixed t) as shown in Fig. 6a and b. This correlation becomes weaker as the prior becomes more sparse (compare Fig. 6c and d, for which co = 0.03 and 0.07, respectively). That is because a very sparse prior (low co) helps the granule cells to be highly selective at an early stage, enabling the lifetime sparseness to quickly converge to a small value (vertical cluster on the left edge of Fig. 6c and d). These results indicate that the global and postsynaptic-neuron-specific adaptation of the learning rate cooperatively help fast learning.

Fig. 6: Adaptive synaptic plasticity.
figure 6

a Weight error when \(1/t{\rho }_{j}^{t}\) is fixed (gray lines), \({\rho }_{j}^{t}\) is fixed (light green), and fully adaptive (orange). For the gray lines we used learning rates of 0.01, 0.1, 1.0, correspond to light gray to dark gray. The sparsity, co, was 0.03. b Same as panel a, but with a lower sparsity, co = 0.07. c, d Correlations between the lifetime sparseness and the learning rate, after 300 stimuli were presented to the network, under more sparse (c: co = 0.03) and less sparse (d: co = 0.07) conditions. Lines are linear regressions, and each dot represents one granule cell. Correlation were significant for both c and d (p ≪ 10−6). Vertical clusters appearing on the left edges of the panels correspond to neurons with very small lifetime sparseness. In all panels, M = 100 odors, N = 400 glomeruli, and 3 (a, c) or 7 (b, d) odors were presented on average. Light-green and orange lines in a and b are mean over 50 simulations, while the rest were calculated from 10 simulations.

Learning concentration invariant representation and valence

Our results so far indicate that olfactory learning is well characterized as an approximate Bayesian learning process. Our circuit estimates odor concentration, which is important for locating an odor source32. However, the perceived concentration depends on factors such as the distance from the odor source, its size, and wind speed. Thus, odor concentration is not a reliable indicator of the amount of reward expected. Hence, acquisition of a concentration-invariant representation is highly useful for many olfactory-guided behaviors.

A concentration-invariant representation is essentially a representation of the probability of an odor being present, denoted \({\overline{p}}_{j}\). Because of the spike in our prior, \({\overline{p}}_{j}=\Pr [{c}_{j}> 0]\), thus probability is easily decoded from M/T cells using the circuit depicted in Fig. 7a (see subsection “Learning of concentration-invariant representation” in the Methods section). Here, \({\overline{p}}_{j}\) could be represented in layer 2 of piriform cortex neurons, as that is the main downstream target of M/T cells, and odor representation in piriform cortex is approximately concentration-invariant21,33. As the granule cells acquire odor representation, neurons in piriform cortex acquire odor probability representation (cyan and dark blue line in Fig. 7e left).

Fig. 7: Learning a concentration invariant representation and an odor-reward association.
figure 7

ad A set of increasingly realistic decoding models. a The decoding model associated with our variational Bayesian inference algorithm. Note that the weights need to be copied from wF to wp, something that is not biologically plausible. b Similar circuit, but with the mapping from m to \(\overline{p}\) learned via a local rule. c Same as b, but with lateral inhibition. d Same as c, but with feedback to the granule cells. e Learning performance for the models in ad when decoding from granule cells (cyan) or piriform cortex (dark blue; see subsection “Odor estimation performance” in the “Methods” section). f Comparison of performance for model c (gray) and d (orange). Mean and standard deviation over 10 simulations are plotted. g Mean and standard deviation of responses of the granule cells, \(\bar{c}\), and the piriform neurons, \(\bar{p}\), for their selective odors presented at various concentrations. The responses were measured by presenting each odor in isolation with different concentration, and then averaging over populations. h Schematic of the reward prediction circuit utilizing concentration-invariant representation in the piriform cells, \(\bar{p}\). i Direct reward prediction from neural activity at glomeruli. j Performance of odor–reward association measured by the classification performance (left) and the mean-squared error between the predicted reward and the actual reward (right) for the models in panels h (magenta) and i (purple). Lines are mean over 100 simulations. k The mean response of neuron ep given an odor associated with the reward. The vertical line at τ = 2.5 s represents the reward presentation, and the dotted horizontal line is the sign-flipped reward value (−R). Different colors represents the different concentrations of the presented odor, from purple (c ≈ 0.1) to yellow (c ≈ 2.0). In all panels, M = 50 odors, N = 200 glomeruli, and three odors were presented on average, except for the go/no go task where one of two selected odors was presented randomly.

While the circuit shown in Fig. 7a exhibits good performance, it is not consistent with the mammalian olfactory system, in two ways. First, the weights from the M/T cells to the granule cells have to be copied to the corresponding M/T to piriform cortex connections (i.e. wp = wF), something that is not biologically plausible. Second, a direct projection from granule cells to piriform cortex is needed, but such a connection does not exist. These inconsistencies can be circumvented by modifying the circuit heuristically (Fig. 7b–d). Weight copying can be avoided by learning wp with local synaptic plasticity (Fig. 7b), although in the absence of the teaching signal from the granule cells, this naive extension does not work (dark blue line in Fig. 7e middle-left). However, introducing lateral inhibition among the piriform neurons (Fig. 7c) as observed experimentally21, allows the piriform neurons to acquire odor representation (Fig. 7c and e middle-right), although the decoding performance was worse than the Bayesian model (Fig. 7e left vs. 7e middle-right). Finally, if connections from piriform cells to granule cells are added as well, the learning performance of granule cells became slightly better (Fig. 7d and e right), and more robust to changes in the strength of lateral inhibition (Fig. 7f). As expected, the responses of piriform neurons were mostly concentration-invariant (dark blue line in Fig. 7g), whereas granule cells showed a clear concentration dependence (cyan line in Fig. 7g). Thus, the architecture of the mammalian olfactory circuit indeed supports robust learning of concentration-invariant representation.

Once the circuit acquires a concentration-invariant representation, a circuit that performs odor–reward association can be constructed simply by taking the circuit depicted in Fig. 7d and adding a region that receives input from both piriform neurons and the reward system (ep in Fig. 7h). Olfactory tubercle could be the site for this odor–reward association5,34, but it could be other regions, such as layer 3 of piriform cortex, as well. To test performance of this circuit, we implemented a go/no go task in which one odor is associated with a reward (R = 1.0), while another odor is associated with no reward (R = 0.0), regardless of concentrations. We simulated this task by randomly presenting rewarded or unrewarded stimulus with equal probability (see subsection “Go/no go task” in the Methods section). We used the circuit pre-trained with a large number of odors but without reward. When the reward prediction was learned with the projection from piriform cells, \(\overline{p}\), to olfactory tubercle cells, ep (Fig. 7h), classification performance reaches 90% after just six trials (Fig. 7j; magenta lines). On the other hand, when the circuit learns the task directly from the glomeruli (Fig. 7i), though the circuit still learns to predict the reward as suggested previously35, learning was much slower and the performance was worse even after a large amount of training (Fig. 7j; purple lines). After a dozen odor–reward association from piriform neurons, \(\overline{p}\), olfactory tubercle cell activity, ep, learned to represent the reward prediction given olfactory stimuli unless the concentration is very small (left half of Fig. 7k; in our model—ep is the reward prediction), and once the reward is presented at τ = 2.5 s, the activity went back to near zero (right half of Fig. 7k; in our model, positive ep represents an error, and so drives learning).

These results indicate that unsupervised learning of odor representation may underlie fast reward-based learning, and the proposed Bayesian learning mechanism improves reward association by enabling robust odor representation in a data efficient way.

Discussion

We formulated unsupervised olfactory learning in the mammalian olfactory system as a Bayesian optimization problem, then derived a set of local synaptic plasticity rules and neural dynamics that implemented Bayesian inference (Figs. 2 and 3). Our theory provides a normative explanation of the functional roles for the nonlinear transfer function and the developmental adaptation of the neuronal input gain (Fig. 5), both widely observed among sensory neurons. The model also predicts that the learning rate of dendro-dendritic connections should be approximately linear in the lifetime sparseness of the corresponding granule cells (Fig. 6). Finally, we extended the framework to learning of odor identity by piriform cortex, and showed that such learning supports rapid reward association (Fig. 7).

Our results suggest that adaptation of both input gain (Fig. 5) and learning rate (Fig. 6) are important for successful learning. The developmental reduction in input gain can be explained by a decrease in neural excitability, which is partially caused by the increased expression of K+ channels14. Correspondingly, it is known that changes in channel expression at the dendrite modulate the sensitivity of synaptic plasticity36. In particular, it has been reported that elimination of voltage-gated K+ channels enhances the induction of long-term potentiation37. These results suggest that developmental up-regulation of K+ channel expression at the soma and the dendrite may underlie the adaptation of the input gain and learning rate.

The cellular plasticity rules we derived explain multiple developmental changes in adult born granule cells. Experimentally, relative to young cells, mature granule cells have sparser selectivity20, lower membrane resistance17,18, and are less plastic18, as predicted by our model. In addition, our results provide insight into the functional role of adult neurogenesis. As shown previously8, if each synapse keeps track of its uncertainty, by removing the most uncertain synapses while adding synapses at a random position on the dendritic tree, a neuron can achieve sample-based Bayesian learning, making neurogenesis unnecessary. However, in our unsupervised learning framework, uncertainty is defined at neurons, not at synapses. As a result, from a Bayesian perspective, there is no good way to perform synaptogenesis. Thus, the brain should instead remove the most uncertain neurons, while at the same time randomly adding new ones.

The importance of the feedback circuit between M/T cells and granule cells has been noted previously6,38, but plasticity mechanisms that generate this circuit have not been considered. Recently, several groups proposed learning algorithms for unsupervised olfactory learning using stochastic gradient descent11,12,39, as in the case of our sparse coding model. However, as we have seen (Fig. 4), these algorithms are very unlikely to be fast. In addition to the sparse coding model, our problem setting is deeply related to Independent component analysis (ICA)40. Indeed, by using the sparseness as the measure of non-Gaussianity, unsupervised olfactory learning can be reformulated as an ICA problem11.

The spike-and-slab prior employed here is widely used in machine learning41, and has been applied to the sparse coding model of the early visual system42, and a normative analysis of nonlinear transfer functions has been carried out previously43. A contribution of this work is the establishment of a link between the spike-and-slab prior and nonlinear transfer function of a neuron.

Studies of adaptive learning rates date back many decades44,45; more recent studies have taken a Bayesian approach to adaptive learning in simplified single neurons models7. In this study, we considered an unsupervised learning problem, and showed that the learning rate of excitatory feedforward connections should depend only on the postsynaptic activity, independent of the presynaptic activity. Moreover, our theory predicted a non-trivial relationship between the learning rate and the lifetime sparseness of the postsynaptic neuron (Fig. 6c and d).

Acceleration of reward-based learning by unsupervised learning (Fig. 7j) has been studied in the context of both semi-supervised learning and model-based reinforcement learning. In particular, the latter approach has been applied to rapid learning by animals, but these were limited to abstract models, not circuit-based implementations46. In the invertebrate literature, Bazhenov and colleagues (2013) studied the combination of unsupervised and reward-based learning in a computational model of the insect brain47, but plasticity was applied only to the output connections (corresponding in our model to \(\overline{p}\to {e}_{\mathrm{{{p}}}}\) in Fig. 7h). Interestingly, in the invertebrate brain, the connections corresponding to \(m\to \overline{p}\) are mostly random and fixed48, so the acceleration shown in Fig. 7j is potentially unique to vertebrates.

While our approach gave us a model that is reasonably consistent with mammalian olfactory circuitry, it is not perfect. In particular, the architecture predicted by our approximate Bayesian algorithm does not match perfectly the architecture of the olfactory bulb, piriform cortex, and olfactory tubercle. We were able to make small modifications to our circuit so that it did match the biology, and still gave decent performance, but performance was about 10% worse than the circuit predicted purely by Bayesian inference (blue lines in Fig. 7e-left vs. 7e-right). This discrepancy between the predicted and observed architecture highlights a limitation of this approach, especially when applied to complex systems. In particular, it is difficult to include biological constraints, both because we do not know exactly what they are, and because there is no straightforward way to marry those constraints with a normative Bayesian approach. However, that is an important avenue for future work.

Methods

Stimulus configuration

On each trial, the response of the ith glomerulus is modeled as

$${x}_{i}=\sum_{j}{w}_{ij}{c}_{j}+{\sigma }_{x}\xi_{i}$$
(6)

where cj is the concentration of odor j, and ξi is a zero mean, unit variance Gaussian random variable. The Gaussian assumption is justified because, although olfactory sensory neurons fire with approximately Poisson statistics, 1000–10,000 sensory neurons converge to a single glomerulus22, where OSN activity is conveyed to M/T cells as stochastic currents. We take the affinities, or mixing weights, w, to be log normal, followed by a normalization step

$${\mathrm{log}}\,{\widetilde{w}}_{ij} \sim {\mathcal{N}}(-{\mathrm{log}}\,({c}_{{\mathrm{{o}}}}M),1)$$
(7a)
$${w}_{ij}={\widetilde{w}}_{ij}\times \frac{\frac{1}{NM}\mathop{\sum }\nolimits_{i}^{N}\mathop{\sum }\nolimits_{j}^{M}{\widetilde{w}}_{ij}}{\frac{1}{M}\mathop{\sum }\nolimits_{j}^{M}{\widetilde{w}}_{ij}}$$
(7b)

where recall, M is the number of odors and N is the number of glomeruli. The factor multiplying \({\widetilde{w}}_{ij}\) is 1 on average, so the normalization step does not have a huge effect on the weights. However, it forces ∑jwij to be strictly independent of i, which makes the learning process less noisy.

On each trial, odors cj (j = 1, 2, . . . , M) are generated from the spike-and-slab prior given as

$${p}_{{\mathrm{{c}}}}({c}_{j})=(1-{c}_{{\rm{o}}})\delta ({c}_{j})+{c}_{{\rm{o}}}\frac{{\alpha }^{\alpha }}{\Gamma (\alpha )}{c}_{j}^{\alpha -1}{{\mathrm{{e}}}}^{-\alpha {c}_{j}}\Theta ({c}_{j}),$$
(8)

where Θ(x) is a Heaviside function. We used α = 3 everywhere except Supplementary Fig. 1, where we used α = 1. Under this prior, each odor is independently presented with probability co, and its amplitude follows a Gamma distribution with unit mean (Fig. 1a left). Note that the amplitude, cj, reflects log-concentration rather than concentration24. To avoid the null stimulus, we resampled the odors if all of the cj were 0 on any particular trial.

Bayesian model

As discussed in the main text, we mainly focus on unsupervised learning, in which animals see only glomeruli activity and must make sense of it. This is essentially a clustering problem: if the same pattern of glomeruli activity occurs multiple times, the brain should recognize it as an odor. The activity patterns at the glomeruli are determined by the product of odorant concentrations in the inhaled air, and the affinities of the OSNs for those odorants. Thus, to recognize an odor, animals have to effectively learn the affinities of OSNs for each odor, and store them in the olfactory circuitry. As we will see, in our model they are stored as weights between M/T cells and granule cells. Once those weights are stored, if an odor co-occurs with a reward (or punishment), the valance of that odor can be determined. And indeed, we find that unsupervised learning enables rapid learning of odor–reward associations.

More formally, the goal of the olfactory system is to infer the odor at time t, ct, given all past presentations of odors, x1:t ≡ {x1x2, . . . , xt}. Because the weights are not known, they must be integrated out

$$p({{\bf{c}}}_{t}| {{\bf{x}}}_{1:t})=\int\ {\mathrm{{d}}}{\bf{w}}\ p({{\bf{c}}}_{t},{\bf{w}}| {{\bf{x}}}_{1:t}).$$
(9)

Using Bayes’ theorem, this can be written in a more intuitive form

$$p({{\bf{c}}}_{t}| {{\bf{x}}}_{1:t})\propto \int\ {\mathrm{{d}}}{\bf{w}}\ p({{\bf{x}}}_{t}| {{\bf{c}}}_{t},{\bf{w}}){p}_{{\mathrm{{c}}}}({{\bf{c}}}_{t})p({\bf{w}}| {{\bf{x}}}_{1:t-1})$$
(10)

where, recall, pc(ct) is the prior over odors. To derive this expression, we used two facts: given ct and w, xt does not depend on past observations, and ct does not depend on past observations. The first term on the right-hand side, p(xtctw) is the likelihood given the weights; but because we do not know the weights, we have to marginalize over them given past observations. The marginalization step is intractable, as we have to introduce past odors and then integrate them out. This leaves us with an integral over w (Eq. (10)) that cannot be performed analytically. And even if it could, the circuit would have to memorize all past stimuli, x1x2, . . , xt−1. We thus have to perform approximate inference. For that we make a variational approximation.

Variational approximation

The integral in Eq. (10) becomes easier if the distributions factorize. We thus make the variational approximation

$$p({\bf{c}},{\bf{w}}| {{\bf{x}}}_{1:t-1},{\bf{x}})\approx {q}^{t}({\bf{w}},{\bf{c}})\equiv {\prod }_{ij}{q}_{ij}^{w,t}({w}_{ij})\times {\prod }_{j}{q}_{j}^{c}({c}_{j})$$
(11)

where, to avoid a proliferation of subscripts, we suppress the fact that c and \({q}_{j}^{c}\) are to be evaluated at trial t; in line with this, to simplify subsequent equations we replace xt with x; and, as is standard, we suppress the dependence of q on x1:t.

The variational distributions, \({q}_{ij}^{w,t}\) and \({q}_{j}^{c}\), are found by minimizing the KL-divergence with respect to the true distribution, with the KL-divergence given by

$${D}_{{\mathrm{{KL}}}}\left[{q}^{t}({\bf{w}},{\bf{c}})| | p({\bf{c}},{\bf{w}}| {{\bf{x}}}_{1:t-1},{\bf{x}})\right]=\int\ {\mathrm{{d}}}{\bf{c}}{\mathrm{{d}}}{\bf{w}}\ {q}^{t}({\bf{w}},{\bf{c}}){\mathrm{log}}\,\frac{{q}^{t}({\bf{w}},{\bf{c}})}{p({\bf{c}},{\bf{w}}| {{\bf{x}}}_{1:t-1},{\bf{x}})}\ .$$
(12)

As is straightforward to show9, minimizing this quantity leads to the update rules

$${\mathrm{log}}\,{q}_{ij}^{w,t}({w}_{ij}) \sim {\langle {\mathrm{log}}\,p({\bf{x}}| {\bf{c}},{\bf{w}})\rangle }_{\backslash {w}_{ij}}+{\langle {\mathrm{log}}\,p({\bf{w}}| {{\bf{x}}}_{1:t-1})\rangle }_{\backslash {w}_{ij}}$$
(13a)
$${\mathrm{log}}\,{q}_{j}^{c}({c}_{j}) \sim {\langle {\mathrm{log}}\,p({\bf{x}}| {\bf{c}},{\bf{w}})\rangle }_{\backslash {c}_{j}}+{\mathrm{log}}\,{p}_{c}({c}_{j})$$
(13b)

where  ~  indicates equality up to a constant, \wij indicates an average with respect to the variational distribution over all variables except wij, and, similarly, \cj indicates an average with respect to the variational distribution over all variables except cj. In the first equation, we approximate p(wx1:t−1) with the variational distribution at the previous time step, \({\prod }_{ij}{q}_{ij}^{w,t-1}({w}_{ij})\), which makes the marginalization self-consistent. This approximation breaks down early in the learning process; nevertheless, in practice it works quite well. Using this approximation, we arrive at

$${q}_{ij}^{w,t}({w}_{ij})\propto {q}_{ij}^{w,t-1}({w}_{ij})\exp \left[{\left\langle {\mathrm{log}}\,p({\bf{x}}| {\bf{c}},{\bf{w}})\right\rangle }_{\backslash {w}_{ij}}\right]$$
(14a)
$${q}_{j}^{c}({c}_{j})\propto {p}_{{\mathrm{{c}}}}({c}_{j})\exp \left[{\langle {\mathrm{log}}\,p({\bf{x}}| {\bf{c}},{\bf{w}})\rangle }_{\backslash {c}_{j}}\right]\ .$$
(14b)

In the next two subsections we derive explicit update rules by computing the averages in these expressions.

The variational odor distribution

To find the variational distribution over odors, we need to compute the average over \({\mathrm{log}}\,p({\bf{x}}| {\bf{c}},{\bf{w}})\) that appears on the right-hand side of Eq. (14b). Using the fact that the x follows a Gaussian distribution, we have

$${\langle {\mathrm{log}}\,p({{\bf{x}}}_{t}| {\bf{c}},{\bf{w}})\rangle }_{\backslash {c}_{j}} \sim -\frac{1}{2{\sigma }_{x}^{2}}{\left\langle {\sum }_{i}{\left({x}_{i}^{t}-{\sum }_{m}{w}_{im}{c}_{m}\right)}^{2}\right\rangle }_{\backslash {c}_{j}}\\ \sim -\frac{{\sum }_{i}\langle {{w}_{ij}^{t}}^{2}\rangle }{2{\sigma }_{x}^{2}}{\left({c}_{j}-\frac{1}{{\sum }_{i}\langle {{w}_{ij}^{t}}^{2}\rangle }{\sum }_{i}\langle {w}_{ij}^{t}\rangle \left[{x}_{i}^{t}-{\sum }_{m\ne j}\langle {w}_{im}^{t}\rangle \langle {c}_{m}\rangle \right]\right)}^{2},$$
(15)

where the averages are with respect to the variational distribution. This is Gaussian, and it is straightforward to work out the mean and variance. Note that both depend on the first and second moments of the weights (which, as we will see below, determine the variational weight distribution) evaluated, importantly, at time t. However, synaptic plasticity is much slower than neural dynamics, so it is reasonable to update the weights on a slower timescale than concentration. Thus, when evaluating the mean and variance, we use the weight distribution on the previous time step. Using \({\mu }_{j}^{t}\) and \(1/{\lambda }_{j}^{t}\) to denote the mean and variance, and making this approximation, we have

$${\mu }_{j}^{t}\equiv \frac{1}{{\sum }_{i}\langle {{w}_{ij}^{t-1}}^{2}\rangle } \sum_{i}\langle {w}_{ij}^{t-1}\rangle \left[{m}_{i}^{t}+\langle {w}_{ij}^{t-1}\rangle \langle {c}_{j}\rangle \right]$$
(16a)
$${\lambda }_{j}^{t}\equiv \frac{1}{{\sigma }_{x}^{2}}\sum_{i}\langle {{w}_{ij}^{t-1}}^{2}\rangle$$
(16b)

where we made the definition

$${m}_{i}^{t}\equiv {x}_{i}^{t}-\sum_{j = 1}^{M}\langle {w}_{ij}^{t-1}\rangle \langle {c}_{j}\rangle \ .$$
(17)

The distribution \({q}_{j}^{c}({c}_{j})\) can now be written in a very compact form

$${q}_{j}^{{c}}({c}_{j})\propto {p}_{{\mathrm{{c}}}}({c}_{j})\ \exp \left[-\frac{{\lambda }_{j}^{t}}{2}{\left({c}_{j}-{\mu }_{j}^{t}\right)}^{2}\right].$$
(18)

As we will see below, to update the weights we just need the first and second moments of cj (see Eq. (27a)). And for the reward-based learning, we need the probability that cj is positive. These quantities are straightforward, if tedious, to compute, and are given as follows.

For the first moment,

$$\langle {c}_{j}\rangle =\frac{1}{{Z}_{j}\sqrt{{\lambda }_{j}}}\left([2+{\alpha }_{j}^{2}]+{\alpha }_{j}[3+{\alpha }_{j}^{2}]\Psi ({\alpha }_{j})\right),$$
(19)

where the average is with respect to the distribution in Eq. (18), Zj is the normalization constant

$${Z}_{j}\equiv \frac{2(1-{c}_{{\rm{o}}})}{27{c}_{{\rm{o}}}}{\lambda }_{j}^{3/2}+{\alpha }_{j}+(1+{\alpha }_{j}^{2})\Psi ({\alpha }_{j}),$$
(20)

and αj and Ψ(αj) are defined by

$${\alpha }_{j}\equiv \sqrt{{\lambda }_{j}}{\mu }_{j}-\frac{3}{\sqrt{{\lambda }_{j}}}$$
(21a)
$$\Psi ({\alpha }_{j})\equiv \sqrt{2\pi }{e}^{{\alpha }_{j}^{2}/2}\Phi ({\alpha }_{j}),$$
(21b)

with Φ the cumulative normal function

$$\Phi (\alpha )\equiv \frac{1}{\sqrt{2\pi }}\int_{-\infty }^{\alpha }{{\mathrm{{e}}}}^{-{x}^{2}/2}{\mathrm{{d}}}x.$$
(22)

Similarly, the second moment is given by

$$\langle {c}_{j}^{2}\rangle =\frac{1}{{Z}_{j}{\lambda }_{j}}\left({\alpha }_{j}(5+{\alpha }_{j}^{2})+(3+6{\alpha }_{j}^{2}+{\alpha }_{j}^{4})\Psi ({\alpha }_{j})\right).$$
(23)

And finally, the probability that an odor is present is written

$$\Pr [{c}_{j}> 0]=\frac{1}{{Z}_{j}}\left({\alpha }_{j}+(1+{\alpha }_{j}^{2})\Psi ({\alpha }_{j})\right)\ .$$
(24)

The variational weight distribution

To find the variational distribution over weights, we need to compute the average on the right-hand side of Eq. (14a). This is the same as Eq. (15), except that the average now excludes wij rather than cj,

$${\langle {\mathrm{log}}\,p({\bf{x}}| {\bf{c}},{\bf{w}})\rangle }_{\backslash {w}_{ij}} \sim -\frac{1}{2{\sigma }_{x}^{2}}{\left\langle {\left({x}_{i}-\sum_{m}{w}_{im}{c}_{m}\right)}^{2}\right\rangle }_{\backslash {w}_{ij}}\\ \sim -\frac{\langle {c}_{j}^{2}\rangle }{2{\sigma }_{x}^{2}}{\left({w}_{ij}-\frac{\langle {c}_{j}\rangle }{\langle {c}_{j}^{2}\rangle }\left[{x}_{i}-\sum_{m\ne j}\langle {w}_{im}^{t}\rangle \langle {c}_{m}\rangle \right]\right)}^{2}$$
(25)

where the averages are, as above, with respect to the variational distributions. This is a quadratic function of wij; thus, if we assume that \({q}_{ij}^{w,t-1}({w}_{ij})\) is Gaussian, then \({q}_{ij}^{w,t}({w}_{ij})\) is also Gaussian. Using \({\overline{w}}_{ij}^{t}\) and \(1/(t{\rho }_{j}^{t})\) to denote the mean and variance at time t, respectively (the latter to anticipate the 1/t falloff of the variance expected under Bayesian filtering), Eq. (14a) becomes

$$-\frac{t{\rho }_{j}^{t}}{2}{({w}_{ij}-{\overline{w}}_{ij}^{t})}^{2} \sim -\frac{(t-1){\rho }_{j}^{t-1}}{2}{\left({w}_{ij}-{\overline{w}}_{ij}^{t-1}\right)}^{2}-\frac{\langle {c}_{j}^{2}\rangle }{2{\sigma }_{x}^{2}}{\left({w}_{ij}-\frac{\langle {c}_{j}\rangle }{\langle {c}_{j}^{2}\rangle }\left[{x}_{i}-\sum_{m\ne j}{\overline{w}}_{im}^{t}\langle {c}_{m}\rangle \right]\right)}^{2}.$$
(26)

As in Eq. (15), \({\overline{w}}^{t}\) appears on the right-hand side of Eq. (26). However, very fast synaptic plasticity is required for solving this equation recursively for all the weights. We thus approximate the right-hand side by using the previous timestep, t−1, rather than the current one, t; an approximation that should be good when the weights change slowly. Doing that, we arrive at the update rules

$${\rho }_{j}^{t}=(1-1/t){\rho }_{j}^{t-1}+\frac{1/t}{{\sigma }_{x}^{2}}\langle {c}_{j}^{2}\rangle$$
(27a)
$${\overline{w}}_{ij}^{t}=(1-1/t)\frac{{\rho }_{j}^{t-1}}{{\rho }_{j}^{t}}{\overline{w}}_{ij}^{t-1}+\frac{1/t}{{\rho }_{j}^{t}{\sigma }_{x}^{2}}\langle {c}_{j}\rangle \left({m}_{i}^{t}+{\overline{w}}_{ij}^{t-1}\langle {c}_{j}\rangle \right)$$
(27b)

where we used Eq. (17) to simplify the second expression. Note that the update rule for \({\overline{w}}_{ij}^{t}\) is local, as it depends only on variables indexed by i and j. The update rule for ρj is also local, and in fact depends only on variables indexed by j.

Finally, it is convenient to write the update rules for the mean and precision of the variational distribution over concentration, Eq. (16), in terms of \({\overline{w}}_{ij}\) and ρj,

$${\mu }_{j}^{t}\equiv \frac{1}{{\sigma }_{x}^{2}{\lambda }_{j}^{t}}\sum_{i}{\overline{w}}_{ij}^{t-1}\left[{m}_{i}^{t}+{\overline{w}}_{ij}^{t-1}\langle {c}_{j}^{t}\rangle \right]$$
(28a)
$${\lambda }_{j}^{t}\equiv \frac{1}{{\sigma }_{x}^{2}}\sum _{i}{\left({\overline{w}}_{ij}^{t-1}\right)}^{2}+\frac{N}{{\sigma }_{x}^{2}(t-1){\rho }_{j}^{t-1}}\ .$$
(28b)

As shown in Fig. 5c, the transfer function shifts to the right with learning. This seems counter-intuitive: because the weights become more certain with learning, it should take less input to the granule cells to produce activity; this suggests that the transfer functions should shift left, not right. However, an increase in certainty is not the only thing that changes with learning; the weights also become more diverse, capturing the diverse responses of glomeruli for each odor. The diversity increases the variance of the input to the granule cells, and so to ensure a sparse response with increasing diversity, the transfer functions need to shift to the right. In our model, increased diversity (the first term in Eq. (28b)) had a larger effect than increased certainty (the second term), resulting in a net rightward shift in the transfer functions.

Network model

The analysis in the previous sections revealed that under the variational approximation, the distribution of the odors and the weights are updated locally. Thus, we implement the update rules in a network model of the olfactory bulb. The update of the weight distribution, \({q}_{ij}^{w,t}({w}_{ij})\), depends on 〈cj〉 and \(\langle {c}_{j}^{2}\rangle\), as shown in Eq. (27), while the update of the odor distribution, \({q}_{j}^{c,t}({c}_{j})\), depends on \({\overline{w}}_{ij}\) and ρj, as shown in Eq. (28). Ideally, all these parameters should be updated simultaneously. However, as mentioned above, updates to synaptic weights are typically much slower than the neural dynamics, so here we consider a two step update. First, the relevant parameters of the variational odor distribution, 〈cj〉 and \(\langle {c}_{j}^{2}\rangle\), are updated using the mean and precision of the weight distribution, \({\overline{w}}_{ij}\) and ρj, evaluated at t−1. Then, \({\overline{w}}_{ij}\) and ρj are updated using the first and second moments of the weights, 〈cj〉 and \(\langle {c}_{j}^{2}\rangle\), evaluated at time t.

Neural dynamics

Our goal is to write down a set of dynamical equations for 〈cj〉 and \(\langle {c}_{j}^{2}\rangle\) whose fixed points correspond to the values given in Eqs. (19) and (23), respectively. Examining these equations, we see that 〈cj〉 and \(\langle {c}_{j}^{2}\rangle\) depend on αj and λj; after a small amount of algebra (involving the insertion of Eq. (28a) into Eq. (21a)), αj may be written

$${\alpha }_{j}=\frac{1}{\sqrt{{\lambda }_{j}}{\sigma }_{x}^{2}}\left(\sum_{i}{\overline{w}}_{ij}{m}_{i}+\sum_{i}{\overline{w}}_{ij}^{2}\langle {c}_{j}\rangle -3{\sigma }_{x}^{2}\right)\ .$$
(29)

To avoid clutter, we dropped the dependence on time, but the weights should be evaluated at time t−1 and all other variables at time t.

Because neither αj nor λj (the latter given in Eq. (28b)) depend on \(\langle {c}_{j}^{2}\rangle\), we can write down coupled equations for 〈cj〉 and mi; the solution of those equations gives us the values of αj and λj, which in turn gives us, via Eq. (23), \(\langle {c}_{j}^{2}\rangle\). Using, for notational ease, \({\overline{c}}_{j}\) rather than 〈cj〉, the simplest such equations (derived from Eqs. (17) and (19)) are

$${\tau }_{{\mathrm{{r}}}}\frac{{\mathrm{{d}}}{m}_{i}}{{\mathrm{{d}}}\tau }={x}_{i}-{m}_{i}-\sum_{j = 1}^{M}{w}_{ij}^{{\rm{L}}}{\overline{c}}_{j}$$
(30)
$${\tau }_{{\mathrm{{r}}}}\frac{{\mathrm{{d}}}{\overline{c}}_{j}}{{\mathrm{{d}}}\tau }=-{\overline{c}}_{j}+{F}_{j}\left[\sum_{i = 1}^{N}{w}_{ji}^{{\rm{F}}}{m}_{i};{\overline{c}}_{j}\right]$$
(31)

where τr is the time constant of the firing rate dynamics, and the nonlinear transfer function, F, is given by the right-hand side of Eq. (19)

$${F}_{j}\left[\sum_{i = 1}^{N}{w}_{ji}^{{\rm{F}}}{m}_{i};{\overline{c}}_{j}\right]\equiv \frac{1}{\sqrt{{\lambda }_{j}}}\frac{(2+{\alpha }_{j}^{2})+{\alpha }_{j}(3+{\alpha }_{j}^{2})\Psi ({\alpha }_{j})}{\frac{2(1-{c}_{{\rm{o}}})}{27{c}_{{\rm{o}}}}{\lambda }_{j}^{3/2}+{\alpha }_{j}+(1+{\alpha }_{j}^{2})\Psi ({\alpha }_{j})}$$
(32)

with αj given in Eq. (29) and λj in Eq. (28b). Note that we have replaced the average weights, \({\overline{w}}_{ij}\), with two different weights, \({w}_{ij}^{{\rm{L}}}\) and \({w}_{ij}^{{\rm{F}}}\). Ideally, we should have \({w}_{ji}^{{\rm{F}}}={w}_{ij}^{{\rm{L}}}={\overline{w}}_{ij}\), but, for biological plausibility, we allow these reciprocal synapses to be learned independently. Note that when evaluating αj, Eq. (29), \({w}_{ij}^{{\rm{F}}}\) should be used. Although the expression for Fj seems complicated, the transfer functions are relatively smooth, and resemble experimentally observed ones (see Fig. 5).

As shown in Fig. 3b, this dynamical system resembles the neural dynamics of the olfactory bulb, under the assumption that mi and \({\overline{c}}_{j}\) are the firing rates of M/T cells and the granule cells, respectively. With this assumption, \({w}_{ji}^{{\rm{F}}}\) is the connection from M/T cell i to granule cell j and \({w}_{ij}^{{\rm{L}}}\) is the connection from granule cell j to M/T cell i.

Finally, the second moment of the concentration is given, via Eq. (23), by

$$\langle {c}_{j}^{2}\rangle ={G}_{j}\left[\sum_{i}^{N}{w}_{ji}^{{\mathrm{{F}}},t-1}{m}_{i};{\overline{c}}_{j}\right]\equiv \frac{1}{{\lambda }_{j}}\frac{{\alpha }_{j}(5+{\alpha }_{j}^{2})+(3+6{\alpha }_{j}^{2}+{\alpha }_{j}^{4})\Psi ({\alpha }_{j})}{\frac{2(1-{c}_{{\rm{o}}})}{27{c}_{{\rm{o}}}}{\lambda }_{j}^{3/2}+{\alpha }_{j}+(1+{\alpha }_{j}^{2})\Psi ({\alpha }_{j})}\ .$$
(33)

Synaptic plasticity

After trial t, the average feedforward weights, \({w}_{ji}^{{\rm{F}}}\), and the average lateral weights, \({w}_{ij}^{{\rm{L}}}\), are updated as in Eq. (27b)

$${w}_{ji}^{{\mathrm{{F}}},t}=\left(1-{\delta }_{j}^{w,t}\right){w}_{ji}^{{\mathrm{{F}}},t-1}+\frac{1/t}{{\rho }_{j}^{t}{\sigma }_{x}^{2}}{\overline{c}}_{j}{m}_{i}$$
(34a)
$${w}_{ij}^{{\mathrm{{L}}},t}=\left(1-{\delta }_{j}^{w,t}\right){w}_{ij}^{{\mathrm{{L}}},t-1}+\frac{1/t}{{\rho }_{j}^{t}{\sigma }_{x}^{2}}{m}_{i}{\overline{c}}_{j}$$
(34b)
$${\delta }_{j}^{w,t}\equiv \frac{1}{t}+\left(1-\frac{1}{t}\right)\left(1-\frac{{\rho }_{j}^{t-1}}{{\rho }_{j}^{t}}\right)-\frac{{\overline{c}}_{j}^{2}}{t{\rho }_{j}^{t}{\sigma }_{x}^{2}}\ .$$
(34c)

We used the firing rates mi and \({\overline{c}}_{j}\) at the end of trial, after the neural dynamics has reached steady state. As the weight updates depend primarily on the product of mi and \({\overline{c}}_{j}\), the learning rules are essentially Hebbian. Note that if the initial conditions are the same (i.e., if \({w}_{ji}^{{\mathrm{{F}}},0}={w}_{ij}^{{\mathrm{{L}}},0}\)), then \({w}_{ji}^{{\mathrm{{F}}},t}\) and \({w}_{ij}^{{\mathrm{{L}}},t}\) will remain the same for all time. This is reasonable given that connections between M/T cells and granule cells are dendro-dendritic.

The variance of the weights, \(1/t{\rho }_{j}^{t}\), consists of two components. The first, 1/t, represents the global hyperbolic decay in the learning rate due to accumulation of information. In our simulations, we started t from \(t={t}_{\min }\) to suppress the influence of the initial samples; this is equivalent to using a trial-dependent discount factor \(1/(t+{t}_{\min })\) instead of 1/t, where t is the actual trial count. The second, \({\rho }_{j}^{t}\), represents the neuron-specific contribution to the precision, and is given, via Eqs. (27) and (23), by

$${\rho }_{j}^{t}=(1-1/t){\rho }_{j}^{t-1}+\frac{1}{t{\sigma }_{x}^{2}}{G}_{j}\left[\sum_{i}^{N}{w}_{ji}^{{\mathrm{{F}}},t-1}{m}_{i};{\overline{c}}_{j}\right]\ ,$$
(35)

where Gj, the second moment of the concentration, is given in Eq. (33).

Models with various priors on odor concentration

In our model setting, the prior over concentration, pc(c), enters via Eq. (14b), and affects the transfer functions F and G, given in Eqs. (32) and (33), respectively. Choosing different priors gives different transfer function. Below we consider two common ones: non-negative, and non-negative with an exponential decay.

The first of these is actually an improper prior, pc(c) ∝ Θ(c). This results in gain functions of the form

$$F[{\mu }_{j};{\lambda }_{j}]={\mu }_{j}+\frac{1}{\sqrt{{\lambda }_{j}}\Psi \left[\sqrt{{\lambda }_{j}}\mu_{j} \right]}$$
(36a)
$$G[{\mu }_{j};{\lambda }_{j}]={\mu }_{j}F[{\mu }_{j};{\lambda }_{j}]+\frac{1}{{\lambda }_{j}}$$
(36b)

where μj and λj are given in Eqs. (28a) and (28b), respectively.

Under the non-negative prior introduced above, all positive concentrations are equally likely. However, that is not the case in a typical environment. Far more realistic is to assume that large concentrations are exponentially unlikely, yielding a prior of the form \({p}_{{\mathrm{{c}}}}(c)=\frac{1}{{c}_{{\rm{o}}}}\exp \left(-c/{c}_{{\rm{o}}}\right)\). (The decay constant, co, was chosen so that the mean is equal to co, the same mean as in the true generative model.) For this prior, the functions F and G are

$$F[{\mu }_{j};{\lambda }_{j}]=\left({\mu }_{j}-\frac{1}{{c}_{{\rm{o}}}{\lambda }_{j}}\right)+\frac{1}{\sqrt{{\lambda }_{j}}\Psi \left[\sqrt{{\lambda }_{j}}{\mu }_{j}-\frac{1}{{c}_{{\rm{o}}}\sqrt{{\lambda }_{j}}}\right]}$$
(37a)
$$G[{\mu }_{j};{\lambda }_{j}]=\left({\mu }_{j}-\frac{1}{{c}_{{\rm{o}}}{\lambda }_{j}}\right)F[{\mu }_{j};{\lambda }_{j}]+\frac{1}{{\lambda }_{j}}\ .$$
(37b)

While this prior is suboptimal for olfactory learning, experimental results from visual cortex indicate that the transfer function there resembles the one in Eq. (37a)49 (black curve in Fig. 5a). Indeed, in early visual regions, where the prior is arguably more continuous10, this shifted rectified-linear transfer function, might be more beneficial50.

Learning concentration invariant representations

Up to now we focused on the expected concentration, \({\overline{c}}_{j}\). However, in natural environments animals often care more about whether or not an odor exists in its vicinity than what its concentration is. From a Bayesian perspective, this means the animals should compute the probability that an odor is present, denoted \({\overline{p}}_{j}\). Using Eq. (24), \({\overline{p}}_{j}\) can be estimated as the steady state of the following dynamics:

$${\tau }_{{\mathrm{{r}}}}\frac{{\mathrm{{d}}}{\overline{p}}_{j}}{{\mathrm{{d}}}\tau }=-{\overline{p}}_{j}+{H}_{j}\left[\sum_{i}{w}_{ji}^{{\rm{F}}}{m}_{i},{\overline{c}}_{j}\right]$$
(38)

where Hj, which is approximately sigmoidal, is given, via Eq. (24), by

$${H}_{j}\left[\sum_{i}{w}_{ij}^{{\rm{F}}}{m}_{i},{\overline{c}}_{j}\right]=\frac{{\alpha }_{j}+(1+{\alpha }_{j}^{2})\Psi ({\alpha }_{j})}{\frac{2(1-{c}_{{\rm{o}}})}{27{c}_{{\rm{o}}}}{\lambda }_{j}^{3/2}+{\alpha }_{j}+(1+{\alpha }_{j}^{2})\Psi ({\alpha }_{j})}$$
(39)

with αj given in Eq. (29), but with \({\overline{w}}_{ij}\) replaced by \({w}_{ij}^{{\mathrm{{F}}}}\) in that equation as before.

In principle, neurons receiving input, mi, from M/T cells, such as layer 2 piriform cortex neurons, can decode the odor probability, as shown in Fig. 7a and 7e-left. However, to calculate Hj given input from M/T cells, the neuron would need to know the weights, \({w}_{ij}^{{\rm{F}}}\), as well as λj and \({\overline{c}}_{j}\) (the latter because αj depends on \({\overline{c}}_{j}\); see Eq. (29)). This is clearly unrealistic, because there is no known biological mechanism that enables copying weights. Moreover, because granule cells do not have output projections, except for the dendro-dendritic connections with M/T cells, piriform neurons cannot know \({\overline{c}}_{j}\) directly. Nevertheless, piriform neurons can learn to decode the concentration-invariant representation, \({\overline{p}}_{j}\), as follows.

Let us use \({w}_{ji}^{{\mathrm{{p}}}}\) to denote the mean weight from M/T cells to the piriform neurons (see Fig. 7b–d). Assume for the moment that \({w}_{ji}^{{\mathrm{{p}}}}\approx {w}_{ji}^{{\rm{F}}}\); shortly we will write down a learning rule that achieves this (see Eq. (43)). This takes care of the weights, but we also need an approximation to \({\overline{c}}_{j}\). For that, we notice that if the estimation is unbiased, on average both \({\overline{c}}_{j}\) and \({\overline{p}}_{j}\) are equal to co. Thus, the simplest way to approximate \({\overline{c}}_{j}\) with the information available to the jth piriform neuron is to use \({\overline{c}}_{j}\approx {\overline{p}}_{j}\). Under this approximation, and using \({w}_{ji}^{{\mathrm{{p}}}}\) in place of \({w}_{ji}^{{\rm{F}}}\), Eq. (38) becomes

$${\tau }_{{\mathrm{{r}}}}\frac{{\mathrm{{d}}}{\overline{p}}_{j}}{{\mathrm{{d}}}\tau }=-{\overline{p}}_{j}+{H}_{j}\left[\sum_{i}{w}_{ij}^{{\mathrm{{p}}}}{m}_{i},{\overline{p}}_{j}\right]$$
(40)

where Hj is the same as Eq. (39), but with αj replaced by \({\alpha }_{j}^{p}\)—the analog of αj, but with \({w}_{ij}^{{\mathrm{{p}}}}\) and lateral inhibition

$${\alpha }_{j}^{p}\equiv \frac{1}{\sqrt{{\lambda }_{j}^{p}}{\sigma }_{x}^{2}}\left({\sum }_{i}{w}_{ij}^{{\mathrm{{p}}}}{m}_{i}+\sum_{i}{\left({w}_{ij}^{{\mathrm{{p}}}}\right)}^{2}{\overline{p}}_{j}-{\sigma }_{x}^{2}\left[3+{\lambda }_{j}^{p}\sum_{k\ne j}{J}_{jk}{\overline{p}}_{k}\right]\right)$$
(41)

where considering the analogy with Eq. (28b), \({\lambda }_{j}^{p}\) is given by

$${\lambda }_{j}^{p}\equiv \frac{1}{{\sigma }_{x}^{2}}\sum_{i = 1}^{N}{\left({w}_{ji}^{{\mathrm{{p}}}}\right)}^{2}+\frac{N}{{\sigma }_{x}^{2}(t-1){\rho }_{j}^{p,t-1}}.$$
(42)

As above, \({\overline{p}}_{j}\) evolves with the weights set to their values updated at the end of previous trial. Once the neural dynamics reaches steady state, the weights are updated as in Eq. (34)

$${w}_{ji}^{{\mathrm{{p}}},t}=\left(1-{\delta }_{j}^{p,t}\right){w}_{ji}^{{\mathrm{{p}}},t-1}+\frac{1/t}{{\rho }_{j}^{p,t}{\sigma }_{x}^{2}}{F}_{j}\left[\sum_{i}^{N}{w}_{ji}^{{\mathrm{{p}}},t-1}{m}_{i},{\overline{p}}_{j}\right]{m}_{i}$$
(43a)
$${\delta }_{j}^{p,t}\equiv \frac{1}{t}+\left(1-\frac{1}{t}\right)\left(1-\frac{{\rho }_{j}^{p,t-1}}{{\rho }_{j}^{p,t}}\right)-\frac{1}{t{\rho }_{j}^{p,t}{\sigma }_{x}^{2}}{\left({F}_{j}\left[\sum_{i}^{N}{w}_{ji}^{p,t-1}{m}_{i},{\overline{p}}_{j}\right]\right)}^{2}$$
(43b)

and the precision as in Eq. (27a)

$${\rho }_{j}^{p,t}=(1-1/t){\rho }_{j}^{p,t-1}+\frac{1/t}{{\sigma }_{x}^{2}}{G}_{j}\left[\sum_{i}^{N}{w}_{ji}^{p,t-1}{m}_{i},{\overline{p}}_{j}\right].$$
(44)

Here Fj and Gj are the estimated first/second moment given in Eqs. (32) and Eq. (33), but calculated with \({\alpha }_{j}^{p}\) in Eq. (41). In steady state, these two terms approximate \({\overline{c}}_{j}\) and \(\langle {c}_{j}^{2}\rangle\), respectively. In addition, to ensure sparse piriform cell firing51, we introduced Hebbian plasticity to the lateral weights Jjk,

$$\Delta {J}_{jk}=0.1{\overline{p}}_{k}\left(-5{c}_{{\rm{o}}}{J}_{jk}+{\overline{p}}_{j}\right)\ ,$$
(45)

while bounding Jjk > 0 and enforcing Jjj = 0. We initialized Jjk by Jjk = 0.02.

In Fig. 7e (panel d), 7f (orange line), 7g, and 7j–k, we modified the transfer function Fj of granule cells by replacing the prior term co with the input from piriform neuron \({\overline{p}}_{j}\). This means that \({F}_{j}^{{\rm{D}}}\) is written as

$${F}_{j}^{{\rm{D}}}\left[\sum_{i = 1}^{N}{w}_{ji}^{{\rm{F}}}{m}_{i};{\overline{c}}_{j},{\overline{p}}_{j}\right]\equiv \frac{1}{\sqrt{{\lambda }_{j}}}\frac{(2+{\alpha }_{j}^{2})+{\alpha }_{j}(3+{\alpha }_{j}^{2})\Psi ({\alpha }_{j})}{\frac{2(1-{\overline{p}}_{j})}{27{\overline{p}}_{j}}{\left({\lambda }_{j}\right)}^{3/2}+{\alpha }_{j}+(1+{\alpha }_{j}^{2})\Psi ({\alpha }_{j})}$$
(46)

where αj is still given by Eq. (29). We modulated the gain function Gj of granule cells, Eq. (33), in the same way, by replacing co with \({\overline{p}}_{j}\). In Fig. 7f, we changed the relative strength of lateral inhibition by replacing Jjk in Eq. (41) with κJJjk where κJ, the relative strength, ranged from 0 to 3, as shown in the x-axis of Fig. 7f, while using the original Jjk for the weight update.

Reward-based learning

Assuming that the reward amplitude depends only on the identity of the odors, not on their concentrations, the reward, R, on trial t is given by

$$R=\sum_{j = 1}^{M}{a}_{j}\Theta ({c}_{j})+{\sigma }_{\zeta }{\zeta }_{t}$$
(47)

where ζt is a zero mean, unit variance Gaussian random variable, and Θ(x) is a Heaviside function.

To estimate the reward, we augment the circuit in Fig. 7d by introducing a set of neurons, denoted ep, that receive input both from \({\overline{p}}_{j}\) and the reward, R (see Fig. 7h). Using \(\overline{a}\) to denote those weights, the natural neural dynamics of ep is

$${\tau }_{{{r}}}\frac{{\mathrm{{d}}}{e}_{{{p}}}}{{\mathrm{d}}\tau }=-{e}_{{{p}}}+{\widehat{R}}_{t}(\tau )-\sum_{j}{\overline{a}}_{j}{\overline{p}}_{j}.$$
(48)

To represent the delay in reward delivery, \({\widehat{R}}_{t}(\tau )\) is zero for the first 2.5 s; after that it is set to the value of the reward,

$${\widehat{R}}_{t}(\tau )=\left\{\begin{array}{ll}0&\,\,\tau <2.5\,{\rm{{s}}} \\ R&\,\,\tau \ge 2.5\,{\rm{{s}}}.\end{array}\right.$$
(49)

Note that for the first 2.5 s of the trial, -ep carries a prediction of the upcoming reward from the olfactory input, x. Once the reward is provided, the neuron represents the difference between the expected reward and the actual reward. That difference can be used to drive learning, via Hebbian plasticity

$${\overline{a}}_{j}^{t}={\overline{a}}_{j}^{t-1}+{\eta }_{a}{e}_{{\mathrm{{p}}}}{\overline{p}}_{j}$$
(50)

where \({\overline{a}}_{j}\) is updated only after the reward has been presented. Importantly, ep is evaluated after the reward presentation.

Similarly, for the direct readout from x depicted in Fig. 7i, the reward is predicted by

$${\tau }_{{\mathrm{{r}}}}\frac{{\mathrm{{d}}}{e}_{x}}{{\mathrm{{d}}}\tau }=-{e}_{x}+{\widehat{R}}_{t}-\sum_{i}{h}_{i}{x}_{i}\ ,$$
(51)

with hi again update via Hebbian plasticity,

$${h}_{i}^{t}={h}_{i}^{t-1}+{\eta }_{h}{e}_{x}{x}_{i}\ ,$$
(52)

after the reward has been presented.

Sparse coding

The sparse coding model originally proposed by Olshausen and colleagues10,52 can be applied to the model of olfactory learning as shown below. The basic idea is that the odor, denoted \(\widehat{{\bf{c}}}\), and the weight matrix, denoted \(\widehat{{\bf{w}}}\), that best explains the input, x, should be close to the real c and w. This means \(\widehat{{\bf{c}}}\) and \(\widehat{{\bf{w}}}\) can be estimated by performing stochastic gradient descent on the likelihood of the inputs, x.

However, this is sub-optimal, primarily because uncertainty in \(\widehat{{\bf{c}}}\) and \(\widehat{{\bf{w}}}\) are ignored, even though they are important for data efficient learning45. In addition, for tractability, the prior over the odors is taken to be a continuous function, making it difficult to capture the fact that at any given time most odors are absent. These constraints make the learning algorithm inefficient.

The log likelihood of the data with respect to an unknown set of weights, denoted \(\widehat{{\bf{w}}}\), is given by

$${\mathrm{log}}\,p({{\bf{x}}}_{t}| \widehat{{\bf{w}}}) = \, {\mathrm{log}}\,\left(\int\ p({{\bf{x}}}_{t}| {{\bf{c}}}_{t},\widehat{{\bf{w}}})p({{\bf{c}}}_{t}){\mathrm{{d}}}{{\bf{c}}}_{t}\right)\\ \approx {\mathrm{log}}\,\left(p({{\bf{x}}}_{t}| {\widehat{{\bf{c}}}}_{t},\widehat{{\bf{w}}})p({\widehat{{\bf{c}}}}_{t})\right)+{\rm{const}}.$$
(53)

In the second line, the integral was approximated with the maximum a posteriori estimate \({\widehat{{\bf{c}}}}_{t}=\arg \,\mathop{\max }\limits_{{\bf{c}}}p({{\bf{x}}}_{t}| {\bf{c}},\widehat{{\bf{w}}})p({\bf{c}})\). The objective function is thus given by

$${E}_{t}\equiv {\mathrm{log}}\,p({{\bf{x}}}_{t}| {\widehat{{\bf{c}}}}_{t},\widehat{{\bf{w}}})+{\mathrm{log}}\,p({\widehat{{\bf{c}}}}_{t}).$$
(54)

Because the noise on xt is Gaussian (see Eq. (6)), the first term is a simple quadratic function. However, the second term, \({\mathrm{log}}\,p({\widehat{{\bf{c}}}}_{t})\), requires further approximation to remove the delta function, and thus ensure differentiability of Et with respect to \({\hat{c}}_{j}\). To this end, we approximated the prior with a Gamma distribution: \({p}_{{\mathrm{{c}}}}({\hat{c}}_{j})\propto {\hat{c}}_{j}^{{k}_{c}-1}{}{{\mathrm{{e}}}}^{-{\hat{c}}_{j}/{\theta }_{c}}\), for which the mean is kcθc. We used kc = 3 and θc = co/3, ensuring a mean of co. Under this approximation, the objective function, Et, becomes

$${E}_{t}=\frac{-1}{2{\sigma }_{x}^{2}}{\sum }_{i}{\left({x}_{i}^{t}-\sum _{j}{\widehat{w}}_{ij}{\hat{c}}_{j}^{t}\right)}^{2}+\sum_{j}\left(({k}_{c}-1){\mathrm{log}}\,{\hat{c}}_{j}^{t}-{\hat{c}}_{j}^{t}/{\theta }_{c}\right).$$
(55)

We maximize the objective function via stochastic gradient descent, which occurs in two steps. In the first step, we maximize Et with respect to \(\widehat{{\bf{c}}}\),

$$\Delta {\hat{c}}_{j}\propto \frac{\partial {E}_{t}}{\partial {\hat{c}}_{j}}=\frac{1}{{\sigma }_{x}^{2}}\sum_{i}{\hat{m}}_{i}{\widehat{w}}_{ij}+\frac{{k}_{c}-1}{{\hat{c}}_{j}}-\frac{1}{{\theta }_{c}}\ ,$$
(56)

where \({\hat{m}}_{i}\) is the analog of Eq. (17),

$${\hat{m}}_{i}\equiv {x}_{i}-\sum_{j}{\widehat{w}}_{ij}{\hat{c}}_{j}.$$
(57)

Once \({\hat{c}}_{j}\) has converged, we update the weights via

$$\Delta {\widehat{w}}_{ij}\propto \frac{\partial {E}_{t}}{\partial {\widehat{w}}_{ij}}=\frac{1}{{\sigma }_{x}^{2}}{\hat{m}}_{i}{\hat{c}}_{j}.$$
(58)

To prevent divergence of the weights, after each timestep we apply L-2 normalization (see Eq. (60b) below).

In summary, on each trial, t, first, the \({\hat{c}}_{j}\,(j=1,2,...,M)\) are updated,

$${\hat{c}}_{j}^{t}(\tau )={\hat{c}}_{j}^{t}(\tau -1)+{\eta }_{c}\left(\sum_{i}{\hat{m}}_{i}^{t}(\tau -1){\widehat{w}}_{ij}^{t-1}+{\sigma }_{x}^{2}\left[\frac{2}{{\hat{c}}_{j}^{t}(\tau -1)}-\frac{3}{{c}_{{\rm{o}}}}\right]\right),$$
(59)

where the time step τ runs from 0 to 100,000 in each trial. At the end of trial t, the weights are then updated by

$${\widetilde{w}}_{ij}={\widehat{w}}_{ij}^{t-1}+{\eta }_{w}{\hat{m}}_{i}^{t}{\hat{c}}_{j}^{t}$$
(60a)
$${\widehat{w}}_{ij}^{t}=\frac{e}{{c}_{{\rm{o}}}M}\frac{{\widetilde{w}}_{ij}}{\sqrt{{\sum }_{i}{\widetilde{w}}_{ij}^{2}/N}}\ .$$
(60b)

The learning rates, ηc and ηw, were manually tuned. We used ηc = 0.00001 and ηw = 0.5 unless stated otherwise.

Simulation details

The parameters used in the simulations are given in Table 1. Additional details of the simulations, from the implementation of neural dynamics to the setting of Go/no go task, are provided in Table 1.

Table 1 Definitions and values of the parameters.

Implementation of neural dynamics

The M/T cell activity, mi, was defined relative to a baseline, denoted msp; in Fig. 3c, we plotted \({\widetilde{m}}_{i}\equiv {m}_{i}+{m}_{{\rm{sp}}}\). On each trial, mi was initialized to zero and \({\overline{c}}_{j}\) to co: mi(τ = 0) = 0 (i.e., \({\widetilde{m}}_{i}(0)={m}_{{\rm{sp}}}\)) and \({\overline{c}}_{j}(\tau =0)={c}_{{\rm{o}}}\). In addition, the firing rates were lower-bounded by mi ≥ −msp and \({\overline{c}}_{j}\ge 0\).

To avoid numerical instability, Ψ(α) in Eq. (21b) was approximated as

$$1/\Psi (\alpha )\approx \left\{\begin{array}{ll}-\alpha &\,\frac{\alpha }{\sqrt{2}}<-10\\ \frac{\exp (-{\alpha }^{2}/2)}{\sqrt{2\pi }\Phi (\alpha )}&\,-10\le \frac{\alpha }{\sqrt{2}}\le 10\\ 0&\,10<\frac{\alpha }{\sqrt{2}}\ .\end{array}\right.$$
(61)

Implementation of synaptic plasticity

Both the feedforward and lateral weights were initially sampled from a log-normal distribution

$${w}_{ij}^{t = 0}={\mathrm{log}}\,N({\mu }_{g}^{{\rm{init}}},{\sigma }_{g}^{{\rm{init}}})\ ,$$
(62)

with the variance and mean parameters set to

$${\sigma }_{{\rm{{g}}}}^{{\rm{init}}}=0.1$$
(63a)
$${\mu }_{{\mathrm{{g}}}}^{{\rm{init}}}=\frac{1}{2}\left(1-{({\sigma }_{{\mathrm{{g}}}}^{{\rm{init}}})}^{2}\right)-{\mathrm{log}}\,({c}_{{\rm{o}}}M)\ .$$
(63b)

The precision factors, ρj, were initialized as

$${\rho }_{j}^{t = 0}=\frac{{c}_{{\rm{o}}}}{{\sigma }_{x}^{2}{Z}_{\rho }}\ .$$
(64)

We used Zρ = 0.5, except in Fig. 6b and d, where we used Zρ = 0.3. The weights were lower-bounded by zero. As mentioned above, in the simulations we started t from \(t={t}_{\min }\) to suppress the influence of the initial samples. Recurrent inhibition, J, was initialized to Jjk = 0.02 × (1−δjk).

Learning with a fixed gain function

In Fig. 5d, we fixed all λj at 200 (gray) and 342 (black), while the \({\rho }_{j}^{t}\) were updated at each trial as in Eq. (35).

Learning with a fixed learning rate

Fixing the learning rate, \(1/t{\rho }_{j}^{t}\), to a constant, denoted η, the learning rules for \({w}_{ji}^{{\rm{F}}}\) and \({w}_{ij}^{{\rm{L}}}\) are rewritten as

$$\begin{array}{rcl}{w}_{ji}^{{\rm{{F}}},t}&=&{w}_{ji}^{{\rm{{F}}},t-1}+\frac{\eta }{{\sigma }_{x}^{2}}{\overline{c}}_{j}\left[{m}_{i}+{\overline{c}}_{j}{w}_{ji}^{{\rm{{F}}},t-1}\right]\\ {w}_{ij}^{{\rm{{L}}},t}&=&{w}_{ij}^{{\rm{{L}}},t-1}+\frac{\eta }{{\sigma }_{x}^{2}}{\overline{c}}_{j}\left[{m}_{i}+{\overline{c}}_{j}{w}_{ij}^{{\rm{{L}}},t-1}\right]\end{array}$$
(65)

and λj is given by

$${\lambda }_{j}=\frac{1}{{\sigma }_{x}^{2}}\left(\sum_{i = 1}^{N}{\left({w}_{ji}^{{\rm{F}}}\right)}^{2}+N\eta \right).$$
(66)

Go/no go task

In the simulation of the go/no go task, we selected two odors (j+ and j) out of M total odors, then randomly presented one or the other with concentrations drawn from a Gamma distribution (as in Eq. (8), but cj > 0 and co = 1). The reward associated with j+ was R = 1.0 + ζ (i.e. \({a}_{{j}_{+}}=1.0\)), where ζ is the noise in the observed reward sampled from a zero-mean Gaussian with variance 0.01. The reward associated with j was R = ζ (i.e. \({a}_{{j}_{+}}=0.0\)).

Learning of the circuit shown in Fig. 7h was done in two steps. First, the weights, \({w}_{ij}^{{\rm{F}}},{w}_{ij}^{{\rm{L}}}\) and \({w}_{ij}^{{\rm{{p}}}}\), and the precisions, ρj and \({\rho }_{j}^{{\rm{{p}}}}\), were learned with the unsupervised learning rules. During this unsupervised period, the reward, R, was kept at zero. After 4000 trials of unsupervised learning, we fixed \({w}_{ij}^{{\rm{F}}},{w}_{ij}^{{\rm{L}}},{w}_{ij}^{p}\), ρj, and \({\rho }_{j}^{{\rm{{p}}}}\), then trained the weights \({\overline{a}}_{j}\) using Eq. (50).

The reward weights for the circuits in both Fig. 7h and i, \({\overline{a}}_{j}\) and hj, respectively, were initialized to zero, and the learning rates were manually tuned to the largest stable rates (ηa = 0.5 and ηh = 0.0015). The latter learning rate was smaller because ∥x∥ is typically much larger than \(\parallel \bar{{\boldsymbol{p}}}\parallel\), and also because the update of the hj was more susceptible to instability.

The classification performance was measured by the probability that the predicted and actual reward were both above 0.5 or both below 0.5,

$${\rm{performance}}\equiv \langle \Theta [({R}_{t}-0.5)(-{\widehat{e}}_{{\rm{{p}}}}-0.5)]\rangle \ ,$$
(67)

where \({\widehat{e}}_{{\rm{{p}}}}\) is the value of ep right before the reward delivery (\({\widehat{e}}_{{\rm{{p}}}}={e}_{{\rm{{p}}}}(\tau =2.45\,{\rm{{s}}})\)). Note that, as mentioned above, \({\widehat{e}}_{{\rm{{p}}}}\) should converge to -Rt. Thus, the average error was defined to be

$$\,\text{Average} \, \text{error}\,\equiv {\left\langle {({R}_{t}+{\widehat{e}}_{{\rm{{p}}}})}^{2}\right\rangle }^{1/2}\ .$$
(68)

Performance evaluation

In the following sections, we summarized the performance evaluation methods we employed in this study.

Selectivity of granule cells

Because the network is trained with an unsupervised learning rule, we cannot know which neuron encodes which odor. We thus estimated the selectivity of a neuron from the incoming synaptic weights using a bootstrap method. Specifically, on each trial, the odor o(j) encoded by granule cell j is determined by choosing the odor that yields the maximum covariance between the estimated weights, wF, and the true mixing weight, w,

$$o(j)=\arg \,\mathop{\max }\limits_{m}\sum_{i = 1}^{N}\left({w}_{ji}^{{\rm{{F}}},t}-{\langle {w}_{ji}^{{\rm{{F}}},t}\rangle }_{i}\right)\left({w}_{im}-{\langle {w}_{im}\rangle }_{i}\right).$$
(69)

The selectivity can also be estimated from the activity of a neuron directly, by assuming that the granule cell with the highest activity to odor j codes for odor j. Essentially the same result holds when we take this approach, although accurate readout of selectivity requires a large number of trials. After learning, most neurons learn to encode one odor stably.

Odor estimation performance

Given the odor selectivity, o(j), the original odors can be reconstructed by

$${\hat{c}}_{j}=\frac{{\sum }_{o(m) = j}{\overline{c}}_{j}}{{\sum }_{o(m) = j}1}\ .$$
(70)

The denominator is the number of neurons that encode odor j, which converges to one after successful learning. If both the denominator and the numerator were zero, we set \({\hat{c}}_{j}\) to 0. Performance was defined to be the correlation between the estimated odor concentration, \({\hat{c}}_{j}\), and the true concentration, cj. Evaluation of performance on trial t used o(j) calculated from wF,t−1, not from wF,t. In Fig. 7e and f, we instead calculated the correlation between \({\widehat{p}}_{j}\) and the true value of Θ[cj] using the same method, where

$${\widehat{p}}_{j}=\frac{{\sum }_{{o}_{p}(m) = j}{\overline{p}}_{j}}{{\sum }_{{o}_{p}(m) = j}1}\ ,$$
(71)

using the piriform neuron selectivity op(j).

ROC curve

We calculated the generalized ROC curves as in Fig. 7 of Grabska-Barwińska et al. (2017)6 using \({\hat{c}}_{j}\). We first separated the trials based on the total number of odors presented, and then for each trial we calculated the number of true/false positives under various thresholds θth. The true positive fraction is the fraction of presented odors above a threshold, θth, whereas the false positive count is the number of absent odors above a threshold, θth. The threshold, θth, was varied from 10−6 to 101 in a log scale, with an ~20% increase on every step.

Weight error

Given o(j), the error between the learned feedforward weight, \({w}_{ij}^{{\rm{F}}}\), and the true mixing weight, wij, was calculated by

$${d}_{w}^{{\rm{{F}}},t}\equiv \frac{1}{M}\sum_{j}^{M}\sqrt{\frac{1}{N}\sum_{i}^{N}{\left({w}_{ji}^{{\rm{{F}}},t}/{Z}_{j,t}^{w}-{w}_{i,o(j)}\right)}^{2}},$$
(72)

where \({Z}_{j,t}^{w}={\sum }_{i}{w}_{ji}^{{\rm{{F}}},t}/{\sum }_{i}{w}_{i,o(j)}\). For ease of comparison, in Fig. 6b the weight errors were scaled by 7/3, so that the initial error was similar to the errors shown in Fig. 6a.

Lifetime sparseness

For the measurement of the lifetime sparseness19, we first presented individual odors m = 1, 2, . . . , M, then recorded the activity of granule cells \(\{{\overline{c}}_{j}^{(m)}\}\). Subsequently, we calculated the sparseness using

$${S}_{j}\equiv \frac{{\left(\frac{1}{M}\mathop{\sum }\nolimits_{m = 1}^{M}{\overline{c}}_{j}^{(m)}\right)}^{2}}{\frac{1}{M}\mathop{\sum }\nolimits_{m = 1}^{M}{\left({\overline{c}}_{j}^{(m)}\right)}^{2}}.$$
(73)

The lifetime sparseness, Sj, takes a small value (Sj ≃ 0) if the activity is sparse, while Sj ≲ 1 is satisfied if the activity is uniform/homogeneous. Because of this, the lifetime sparseness is sometimes defined as \({\widetilde{S}}_{j}\equiv 1-{S}_{j}\)53.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.