Introduction

Despite the fact that the most popular approach for training deep neural networks is based on supervised learning [1], the first demonstration of the potential of deep learning stemmed from the discovery of efficient unsupervised learning methods for stochastic neural networks known as Deep Belief Networks (DBNs) [2, 3]. Since their introduction, DBNs have been successfully used in many challenging tasks, ranging from computer vision [4] to acoustic modeling [5], traffic flow prediction [6] and breast cancer classification [7]. These energy-based models have some unique properties compared to other unsupervised deep learning approaches, such as the ability to represent compositional structure [8, 9] and the possibility to be interpreted in terms of well-established theoretical principles rooted in statistical physics [10].

The capability of learning deep generative models using Hebbian-like mechanisms also makes DBNs particularly relevant for cognitive modeling research [11]. Indeed, this class of models offers a principled account for the functional role of top-down processing supported by feedback loops, at the same time providing a bridge to higher-level descriptions of cognition in terms of Bayesian computations [12, 13]. Compared to shallow generative models, hierarchical generative networks allow to study the emergence of increasingly more complex representations of the sensory signal, thus allowing to simulate a wide range of high-level perceptual and cognitive functions, such as numerosity perception [14,15,16], letter perception [17, 18], orthographic processing [19], development of object-action associations [20] and the appearance of visual hallucinations caused by damage in cortical areas [21]. Sparse variants of DBNs have also been used to simulate physiological properties of neurons in the primary and secondary visual cortex [22]. Notably, inference algorithms for DBNs can be implemented using biologically realistic sampling schemes, which explain unique aspects of low-level brain dynamics [23] and can be efficiently reproduced in spiking models [24]. Finally, hierarchical generative models like the DBN offer important insights into the functional role of spontaneous brain activity, both in terms of top-down predictive signals during task execution and in terms of generative replays during rest [25].

However, learning in DBNs has traditionally relied on a greedy, layer-wise training approach: the connection weights of layer n are changed only after the layer \(n - 1\) has been fully trained. Moreover, the trained weights are frozen and do not further change while learning takes place at higher layers. Although efficient from a computational perspective, this learning modality is clearly implausible from a biological standpoint. Indeed, the greedy approach implies that information is not passed to any higher-order network until learning at the lower level can be stopped because it has reached a (somewhat arbitrary) criterion. Though brain development shows variability across regions and may peak at different times, synaptogenesis begins at about the same time in distant regions such as the visual and the prefrontal cortex [26]. Moreover, a substantial degree of plasticity is preserved in adulthood even in the visual cortex (see [27], for review). Finally, spontaneous brain activity, which is thought to be a manifestation of top-down dynamics of generative models [25], is already structured into distinct cortical networks at birth [28].

Besides these neurobiological considerations, another key limitation of the greedy training approach is that it makes the DBN unsuitable for modeling human development. Neural network models are particularly attractive for understanding developmental phenomena [29] because the trajectories in task performance or in the emergence of internal representations can be examined during learning and compared to human empirical data. Moreover, learning trajectories in neural network models can be analyzed as a function of initial starting conditions to study the emergence of developmental disorders [30]. In the cognitive modeling literature, however, simulations based on DBNs have focused on adult performance (i.e., fully trained models) and developmental investigations based on unsupervised learning have adopted alternative learning paradigms based on deep autoencoders [31] trained with error backpropagation [32] to circumvent this problem.

In this work we propose a novel learning scheme for tuning the entire hierarchy of connections in a DBN iteratively (hereafter, iDBN), using a variant of the original Contrastive Divergence (CD) learning algorithm [33]. Through extensive simulations on two different sets of visual stimuli, we demonstrate that the proposed approach can achieve the same final accuracy of the greedy counterpart, at the same time allowing for a precise tracking of the developmental trajectory of the models. We further show that an alternative developmental scheme based on full-stack propagation of top-down information does not converge to an optimal solution, suggesting that recurrent processing between adjacent layers is a key ingredient to successfully drive learning through local signals. In a first set of simulations we rely on the popular MNIST data set of handwritten digits [34], with the goal of validating the proposed iterative scheme on a well-known benchmark. In this setup, we also show that our developmental learning scheme can be extended to continual learning scenarios [35], where the model exploits interleaved learning to incorporate knowledge from another domain (in our case, handwritten letters) without incurring in catastrophic forgetting [36, 37]. Furthermore, we carry out graph theoretical analyses to investigate how structural properties of the network gradually emerge during learning. We finally consider a more recent data set consisting of images containing a variable number of items, which has been used to investigate the perception of visual numerosity in humans and machines [14, 16], to demonstrate how the iDBN can be used to simulate developmental trajectories in the acquisition of cognitive skills.

Methods

In this section we will briefly review the theoretical foundations of deep belief networks and describe their classical, greedy learning algorithm. We will then introduce our iterative learning approach and describe the materials and methods used in our simulations.

Deep Belief Networks

The building block of a DBN is the Restricted Boltzmann Machine (RBM [38]; for a recent overview see [39]), which is a bipartite network composed by two separate sets of neurons: visible neurons, which constitute the interface with the sensory environment and are thus usually clamped to the input data, and hidden (equivalently called ‘latent’) neurons that allow to capture high-order correlations in the data distribution. Learning in RBMs consists in discovering a set of latent features that can be used to compactly describe the statistical regularities in the data distribution, by creating an internal model of the environment that can be used to generate plausible activation patterns in the visible neurons. The lack of connections between neurons in the same layer makes it easy to compute the data-dependent and model-dependent statistics used in the Constrastive Divergence (CD) algorithm [33], because units in the same layer are conditionally independent given the activation of the other layer. Deep belief networks are created by stacking together several RBMs [2] (see Fig. 1a), thus allowing to exploit hierarchical composition of the features learned by the individual RBMs [40].

Fig. 1
figure 1

Graphical representation of the architecture of a 3-layer Deep Belief Network and the learning schemes implemented in the present work. Green arrows represent bottom-up recognition connections, while red arrows represent top-down generative processing. Yellow boxes enclose local computations. We consider the case of CD1, CD-k can be recovered by repeating the sampling steps k times. \(v \sim D\) identifies a data instance sampled from the training set and \(\varvec{h}_i\) represents the hidden activities of layer i. In the greedy scheme a hidden layers are trained sequentially, from bottom to top, and input signals are never projected into layer l unless learning at layer l - 1 is completed. In the iterative scheme b input signals are immediately propagated through the entire deep network, and top-down processing is performed locally at each layer to jointly learn all connection weights. In the full-stack scheme c both feed-forward propagation and top-down processing occur over the entire deep network

Greedy Layer-Wise Learning

For the sake of our argument, it is useful to make an explicit distinction between bottom-up recognition connections and top-down generative processing (represented by green and red arrows, respectively, in Fig. 1). During the recognition (also called “inference”) phase, the sensory pattern is clamped on the visible neurons, and the hidden neurons are activated in a bottom-up fashion in order to infer the most likely configuration (i.e., activation pattern over the hidden neurons) that could have produced the observed data. During the generation phase, the visible neurons are not clamped to the data, and all neurons are activated in a top-down fashion in order to produce a plausible activation pattern, that is, to generate a sample from the internal model.

In the classical approach [2], the RBM constituting the bottom layer of the DBN is initially trained using the input data. Once this first layer has been fully trained, the weights of the first RBM are “frozen” and the input data set is projected into the activation space of the hidden neurons, thus creating a new training set that is used as input for the second RBM (black arrows in Fig. 1a). Once training of the second layer is completed, the data projection is made likewise to create the training set for the third layer; the procedure is repeated for all the layers constituting the DBN, as shown in Fig. 1a. Note that the data patterns projected into the deeper layers are created by computing the conditional probability distribution of the hidden neurons given the activation of the neurons observed in the layer below [41].

Iterative Joint Learning

In contrast to the greedy approach, the proposed iDBN learning algorithm attempts to update at once all the parameters of the hierarchical generative model, regardless of the depth of the respective layer (see Fig. 1b).

As noted before, in the greedy algorithm deeper layers are trained sequentially, using as input the projection of the data through the weights resulting from learning in the previous layers. In our novel iterative algorithm, instead, the training patterns for the deeper layers are immediately created following each sensory experience, by propagating the input across the entire processing hierarchy (green arrows in Fig. 1b). This process mimics the fast feed-forward sweep observed in cortical circuits, where neuronal activity is rapidly routed to a large number of visual areas after stimulus presentation [42, 43]. It should be noted that in our algorithm the complete feed-forward sweep occurs even during the initial learning phase, where all the connection weights are randomly initialized, which makes learning of the subsequent layers challenging since the data distribution (i.e., hidden unit activation) is non-stationary. Concurrently with the fast feed-forward sweep, top-down generative connections are locally used to reconstruct the data representations at each level of the hierarchy (red arrows in Fig. 1b), mimicking the kind of processing supported by recurrent and horizontal connections within cortical areas [42, 44]Footnote 1. A schematic pseudo-code that illustrates the implementation details of our iterative algorithm is reported below2.

figure a

Full-Stack Joint Learning

An alternative way to jointly learn all weights of a DBN could be to first propagate the input across the entire processing hierarchy (as in the first phase of iDBN) and then produce top-down reconstructions starting from the deepest layer back to the sensory layer (see Fig. 1c). This processing scheme is simpler to implement, but suffers from the vanishing gradient problem encountered in standard deep learning settings [45]. We use this full-stack developmental scheme as a benchmark for our iterative developmental scheme.

Simulations

MNIST Data Set

The same DBN structure is used and held fixed during the simulations for both greedy and iterative learning. In order to allow for a fair comparison, we maintained the same architecture and hyper-parameters of the original model [2], which was composed of one visible layer with 784 neurons, two hidden layers with 500 neurons each and a final hidden layer with 2000 neurons (see Fig. 1a). Connection weights are initialized with random values sampled from a Normal distribution N(0, 0.01), while biases are intialized to zero. We also tested an alternative initialization scheme known as the Glorot initialization [47], in order to evaluate learning convergence under more advanced weight initialization strategies. The weights matrices initialized with the Glorot scheme are multiplied by a factor of 0.1 to make them compatible with the range observed in the Normal initialization. The learning rate is set to \(\lambda = 0.01\) and the weight decay coefficient is set to \(\alpha = 0.0001\). The momentum parameter \(\nu\) is set to 0.5 in the initial learning stage and then updated to 0.9 after 5 epochs of training. The loss is minimized using standard stochastic gradient descent, implemented through CD1 learning. The model is trained for 50 epochs. Further tests, using both Normal and Glorot initialization schemes, also included dropout regularization [48] with the probability of unit presence set to \(p=0.1\).

The first quantity inspected to evaluate the quality of the learned models is the accuracy of a linear readout at each hidden layer: since in DBNs the input patterns are non-linearly transformed from one layer to the next one, the internal representations are supposed to become more linearly separable as we move up in the hierarchy [11]. The accuracy of a simple Ridge classifier is used as a measure of separability.

The performance of the trained models is then assessed in three image generation tasks: 1) reproduction of clean images, 2) completion of partially occluded images and 3) denoising of images corrupted by noise. In the second case, an arbitrary number of subsequent rows in the image matrix are set to zero (i.e., turned to black). In the latter case, all values \(\{X_{ij}\}_{i,j = 1}^{28}\) (being 28 the number of side-pixel of the images) are spoiled by adding Gaussian noise, that is \(X_{ij} \leftarrow X_{ij} + \epsilon\), \(\epsilon \sim N(0, 0.5)\). Images are fed to the visible layer of the DBN, propagated through all its hidden layers and then fed back to the visible layer through feedback connections. In this way the noisy / corrupted samples can be adjusted according to the internal model learned by the DBN. Original samples and the corresponding reconstructions are quantitatively compared using mean squared error (hereafter MSE) between input patterns and reconstruction. For each case, the error is computed as \(\Vert X^{0} - X^{\text {r}} \Vert\), where the superscript 0 denotes the original sample and r means that the image has been reconstructed (reproduced, recreated or denoised) and \(\Vert \cdot \Vert\) denotes an \(L^2\) norm. Model performance is averaged over 10 model runs with different random initialization in order to assess the robustness of the analyses. We also visualize the receptive fields of neurons at different hidden layers to qualitatively assess the type of features (i.e., internal representations) learned by the DBNs.

Continual Learning

To further support the cognitive validity of our approach, we investigate whether the proposed developmental algorithm could effectively deal with a challenging continual learning scenario, where the DBN should learn a generative model of data distributions provided incrementally, without forgetting knowledge obtained during the preceding stages. In such scenario it is well known that neural networks suffer from catastrophic forgetting, whereby knowledge learned during the subsequent stages completely disrupts previously learned information (for review, see [37]). Several solutions have been proposed to mitigate this issue, and here we specifically consider one that can be readily implemented within our framework: interleaved learning [49]. We consider a somewhat simplified version of interleaved learning, where unsupervised learning during a second training phase takes advantage of both patterns from the new target distribution and patterns belonging to the previous data set. More advanced implementations could exploit deep generative replay [50] to directly sample previously learned patterns from the hierarchical generative model, allowing to build the mixed data set in a more data-efficient way.

Following the first stage of unsupervised learning on handwritten digits, the DBN is exposed to a subset of handwritten letters from the EMNIST data set [51]. In one setup the DBN is trained sequentially, which means that only letter patterns are used to train the network during the second stage. In the interleaved setup, instead, during the second stage the unsupervised training set includes both digits and letters. In order to balance the number of patterns and output classes, we randomly sample 20,000 digits from the MNIST data set and we evenly sample 20,000 uppercase letters from the first 10 EMNIST classes. Continual learning performance is probed by monitoring both digit recognition accuracy (using the readout classifier trained during the first learning stage) and letter recognition accuracy (training a new readout classifier on the EMNIST patterns).

Graph Analysis

Recent work has shown that graph theory can be successfully used to study deep learning models from a network science perspective [52, 53]. In order to investigate how topological properties of the graph derived from the DBN might emerge during the course of unsupervised learning, we thus performed a graph analysis on the structure of the deep network during learning over the MNIST data set.

From the trained DBN we extract the weights matrices and define a graph having the same architecture and connections weights as the DBN. A methodological difficulty is posed by the continuity of the connection weights: in order to determine the nodes degrees, we thus prune the network by binarizing the connections according to a suitably chosen cutoff threshold. A sensible choice of the cutoff threshold is a value that allows to remove redundant connections and to keep those that contribute most to the signal propagation through the network. Note that the concept of “redundant” and “important” connections is largely arbitrary and not obvious to assess. For the sake of the structural analysis, however, this choice is based on the numerical magnitude of the connections strengths. A range of such thresholds has been set to \(\{0.2, 0.4, 0.6, 0.8, 1, 1.25, 1.5\}\), and structural analysis has been performed for each of these values. A cutoff threshold of c discards the weights in the interval \([-c, c]\), while those outside the interval are kept for the network analysis.

As customary in network science, the structural properties of the “real” networks (DBNs) are compared with the same properties observed in a random counterpart. A random “replica” of the network is generated according to its architectural characteristics (such an number of nodes) and its local properties (e.g., mean node degree or edge probability). The process is composed of two main steps: 1) generate a random replica of the real network and 2) perform the structural analyses on both the instances. In our case, the real network was compared with an analogous binomial graph generated using the probability of edge existence, computed as the ratio between number of effective edges and the maximum number of potential edges (which depends of the number of node couples). This random replica is generated in such a way to have the same architecture of the DBN, in particular it is a stack of bipartite binomial graphs with the same number of nodes as the DBN layers. Due to this architecture, the node connectivity is constrained by the number of nearby layers. For example, one node of the first layer is not allowed to be connected with one node of the third layer or superior. Such graph structure poses a problem in the characterization of the nodes degree distribution, which is used in network science to evaluate whether a graph is random or derived from a real network. These latter (might them be natural, biological, technological or social networks) typically have a degree distribution well described by a power-law [54], while random graphs have a degree distribution that depends on the method they are generated with. For example, binomial random graphs have a degree distribution that follows, by design, the Binomial distribution. In our setup, the constrained nodes connectivity would lead the normalization of the degree distribution to have such architectural bias. To mitigate this effect, we chose to weight the degree of each node according to its potential maximum degree, as described in detail in Appendix C.

Numerosity Data Set

The “numerosity” data set was first introduced by Stoianov and Zorzi [14], who demonstrated that the approximate number of objects in a visual scene can be estimated by a hierarchical processing architecture that learns to extract increasingly more abstract representations from the sensory input in a completely unsupervised way (sample images are provided in Fig. 7 in Appendix A). Here we focus on the development of “number acuity” in the network, which has been recently investigated using a developmental approach based on deep autoencoders [31]. Number acuity can be measured using a numerosity discrimination task, where the network is asked to classify any image in terms of containing a larger or smaller number of objects with respect to a given reference number. Also in this case, to ensure a fair comparison with the original model [14], we considered the same model architecture and task. The DBN is composed by a visible layer composed of 900 visible neurons, while the first and the second hidden layers have 80 and 400 neurons, respectively. We adopted a Normal initialization scheme, where weights are initialized with random values sampled from N(0, 0.1). The learning rate and weight decay are set to \(\lambda = 0.1\) and \(\alpha = 0.0002\), respectively, and the initial and final momentum are set to \(\nu = 0.5\) and 0.9, again with a momentum switch at epoch 5. The model is trained for 100 epochs.

To assess the number acuity of the network (also known as “internal Weber fraction”), a linear classifier is applied to the deepest layer of the model, with the goal of establishing whether the hidden neurons’ activation correspond to an input image containing a numerosity larger than a reference number (i.e., 8 or 16). This task becomes trivial in the limit of the difference between the given numerosity \(n_i\) and the reference number \(N_{\text {ref}}\) being large. For example, it is easier to tell which numerosity is smaller among 4 and 16, but it is harder for 15 and 16. For this reason, each reference numerosity has an associated window of numerosities used for the comparison: \(\{5,\dots , 12\}\) for \(N_{\text {ref}} = 8\) and \(\{10,\dots , 24\}\) for \(N_{\text {ref}} = 16\), so that the ratios \(r_i = n_i / N_{\text {ref}}\) yield the range [0.65, 1.25]. The percentage of correct classifications is analyzed as a function of these numerical ratios: each value \(r_i\) has associated the percentage of correct classifications \(y_i\). This ensemble of points (ry) is used to fit a psychometric function corresponding to a logistic curve, defined by

$$\begin{aligned} y = 1 - \Phi (\mu = r, \sigma = \surd2\,w)\end{aligned}$$
(1)

being \(\Phi\) the cumulative distribution function of the Normal distribution and w the Weber fraction.

Our interest is to compare the progressive refinement of the Weber fraction during the unsupervised learning phase. The final performance can be directly compared with the Weber fraction achieved by a DBN trained using the greedy scheme [14], while the developmental trajectories can be compared to the learning curves recently reported for deep autoencoders [31].

Results

MNIST Data Set

Readout Accuracy and Reconstruction Error Trends

Results discussed in this section only refer to the DBN configuration with Normal initialization and no dropout, since it turned out that differences with Glorot initialization and inclusion of dropout are negligible. The reader may refer to Appendix B for the complete results.

Figure 2a shows that the readout accuracy increases with depth, suggesting that during the course of unsupervised learning the internal representations became more disentangled (i.e., linearly separable). The reconstruction errors (MSE at each hidden layer) keep decreasing monotonically and eventually converge (Fig. 2b). Not surprisingly, results related to the first layer are almost perfectly overlapping for the greedy and iterative learning schemes: the activation of the visible layer corresponds to the raw data in both cases, thus the learned weights do not depend on the training modality. For the second and third layers, instead, the greedy scheme achieves higher readout accuracy in fewer epochs: however, this effect is due to the fact that the previous layers have been already completely trained, thus providing a head-start for the upper layers. At the end of training, the final accuracy is indistinguishable. Notably, the alternative developmental variant based on full-stack propagation does not exhibit the same optimal convergence, as highlighted by the poor reconstruction error measured in the first hidden layer (see Fig. 8 in Appendix A).

Fig. 2
figure 2

Performance of the greedy vs. iterative learning schemes during learning (top and middle panels) and at the end of the unsupervised learning phase (bottom panel). For the latter case we also report the generation capabilities of the alternative developmental scheme based on full-stack propagation

Generative Capabilities

As shown in Fig. 2c, at the end of the learning phase the greedy and iterative DBNs achieve equivalent generative capabilities (samples of generated images are provided in Fig. 9 in Appendix A). The completion task appears as the more challenging, probably because when entire regions of the images are corrupted it is difficult to generate plausible completions. The full-stack developmental version does not converge to a satisfactory generative model, as highlighted by its poor capability in all generation tasks.

Emergent Internal Representations

Receptive fields are useful to qualitatively inspect what kind of features are learned by the different layers of the hierarchical model during training. Such inspection is done by visualizing the connection weights of a given neuron in the input space [11]. The first hidden layer is easy to inspect, since we just need to plot the weights matrix of the first layer \(W_1\). When it comes to neurons of the second hidden layer it is necessary to compose the weights matrices, so that it is still possible to represent the visualization in terms of the visible layer dimension. The weights matrices are simply multiplied, thus producing a linear combination: we can simply plot some chosen rows of the product \(W_2 W_1\) to look at the receptive fields of the second layer neurons, \(W_3 W_2 W_1\) for the third layer, and so forth. As shown in Fig. 10 in Appendix A, the greedy and iterative learning schemes developed qualitatively similar receptive fields across the entire processing hierarchy.

Resilience to Catastrophic Interference

As clearly shown in Fig. 3, the sequential learning setup (dashed lines) is dramatically affected by catastrophic interference: while the readout accuracy on the new letter data set increases, the performance of the classifier trained on the previous digits data set steadily drops as learning proceeds. On the contrary, the interleaved learning setup (solid lines) allows to easily incorporate knowledge from the new letter data set, achieving the same accuracy of the sequential setup, at the same time allowing to maintain previously learned knowledge, as demonstrated by the preservation of the readout accuracy for the digit data set.

Fig. 3
figure 3

Readout accuracy in the continual learning scenario. The sequential learning regimen is strongly affected by catastrophic forgetting, while interleaved learning incorporates information from the new distribution (Letters) while preserving previous knowledge (Digits). Classifiers are evaluated at 10 regularly spaced intervals during each unsupervised learning epoch

Emergence of Structural Properties During Learning

Typically, inspecting the degrees distribution of a given network gives a first idea of the nature of the system, at least to determine whether the network is random or real. Here, our main focus is on how the structural properties (among the other, also the mean degree) are affected by the learning dynamics. The degree distribution itself is not informative about the structural evolution that the network experiences during training but still could give useful insights about the structural differences of each network. We chose to track other global properties, e.g., mean degree, mean geodesic distance and number of connected components. Figure 4 displays these quantities, while Appendix C provides a broad explanation on the degrees distribution and how to approach its evaluation in this constrained-architecture setup.

Fig. 4
figure 4

Contour plots of mean degree, mean geodesic distance and number of connected components for the case of Normal initialization without dropout. Note that while the iterative iDBN algorithm allows to analyze the entire network since earliest learning stages, in the greedy case the upper layers remain untouched by the update rule until the lower layers are fully trained. This motivates the visualization choice of the right column: the dashed line represents the subdivision between layers, so that to display the trend of change in the whole network during the true learning time-span

Results suggest that the deep network undergoes a substantial transformation during unsupervised learning. A major difference lays in the mean degree evolution during training. Referring to Fig. 4a, non-trivial network properties tend to emerge in later stages of learning, especially for the greedy network. This suggests that the iterative implementation favors the development of complex circuits within the network connections and eventually the emergence of larger components. This applies both to the mean degree and to the number of connected components (recall that an isolated node itself is considered a component). The visualization of the mean geodesic distance shows that the evolution of this property is similar in both the greedy and iterative cases, in particular we can observe a phase transition between an initial state in which all nodes are isolated (indeed an isolated node is considered a component in which the geodesic distance is zero) to the emergence of some components. Once the different component connect to each other and the connections strengthen, the mean geodesic distance decreases.

The second set of results, in Fig. 4b, displays the evolution of the same properties in the binomial replicas. The names “binomial greedy” and “binomial iterative” mean that the binomial counterparts are obtained using the probabilities of edge existence derived from the real greedy and iterative networks, in the epochs considered. The main difference is that both the greedy and iterative instances display the same overall behavior. Unsurprisingly, the visualizations of mean degree and connected components practically overlap. This expected results are given by the fact that there is no such thing as the effect of a learning dynamics that shapes the networks internal structure. In addition, the bottom panels (connected components) show that in random networks the formation of only one giant component is strongly encouraged. Note further that mean degree and mean geodesic distances attain larger values in binomial networks. This is indeed what one could expect: random networks do not have hubs, i.e., nodes with a larger degree with respect to the vast majority of the other nodes, which are instead a characteristic of real scale-free networks. The absence of few super-connected nodes implies a larger mean degree.

The results discussed above refer to the evolution of global network properties. As mentioned before, in network science it is well known that real networks exhibit a power-law degree distribution. As a further comparison between the real networks and their binomial counterparts, we thus choose to also inspect the degree distribution in both the greedy and iterative cases. For this test, we analyze the networks at the end of training and we also include the results obtained with the implementation of dropout, to see if the sparsity induced by dropout changes the network structure. Figure 5 shows the visualization of the degree distributions for real and binomial graphs for the case of Normal initialization, both with and without dropout. The effect of the architectural bias discussed above is particularly clear on the two leftmost upper and lower panels, despite the implementation of the modified degree distribution that accounts for the maximal potential connectivity for each node. This result suggests that the modified degree distribution should be further refined in order to model more accurately the distribution of nodes degrees; still, the degrees distributions show qualitatively different shapes, suggesting that learning dynamics in DBNs characterizes the derived network as a non-random graph. Interestingly, the effect of dropout is to make random replicas more structurally similar to real networks. The top and down right panels of Fig. 5 show that a larger connectivity for a small number of nodes is tolerated also in binomial graphs.

Fig. 5
figure 5

Degrees distributions with cutoff threshold set to 0.4. The networks analyzed for this figure have been initialized with the Normal distribution; the case of Glorot initialization does not display significant differences

Numerosity Data Set

As shown in Fig. 6a the psychometric function observed at the end of the unsupervised learning phase is well aligned with the results obtained by Stoianov and Zorzi using the greedy training scheme [14]. The authors reported a Weber fraction value of 0.15, while we obtained a value of 0.17.

Concerning the developmental trajectory, as shown in Fig. 6b the Weber fraction trend displays a significant decrease during the learning period, especially at the early stages of development. This trend is similar to that observed during human development [55, 56]. Note that our curve has been obtained by averaging 20 model runs, training 5 different classifiers for each data point to collect more reliable statistics. Thus, the average values account for a population of 100 data points for each sample epoch. It is interesting to note that in the early stages the network might yield a worsening of the performance, and hence highly varying values of w. The behavior stabilizes to an asymptotic value in more advanced learning stages.

Figure 6c and d display the same data points, along with a power-law fit obtained using the method proposed by Testolin et al. [31]. Due to the initial zero value of the epoch time stamp, the basic functional form is actually a modified power-law [21]:

$$\begin{aligned} y = a(1 + s\,x)^{b} \end{aligned}$$
(2)

The parameters a, b and s are fitted to the data points describing the progressive development of the Weber fraction. The resulting parameters are reported in Table 1: the fit closely follows the trajectory of Weber fraction, hence describing satisfactorily well the development of the number sense across the learning period of our networks.

Table 1 Fitted parameters as in Eq. 2 for both the w values scaled, respectively, according to the method of S & Z [14] and TZM [31]
Fig. 6
figure 6

Final number acuity and developmental trajectories of the DBN. Panel 6a reports the psychometric curve obtained from the numerosity discrimination task at the end of unsupervised iDBN learning. The Weber fraction measures the steepness of the curve, here \(w = 0.17\). Panel 6b reports the trend of decay of the Weber fraction during unsupervised learning. The solid curve represents the average w values obtained in different simulation runs, while the shaded area represents the standard deviation. Panels 6c and 6d show the power-law fit according to Eq. 2. The dashed lines represent reference values for the final w from [14] and [31] respectively

Discussion

Our simulations demonstrate that deep belief networks can be trained iteratively, by jointly adjusting all the weights of the model hierarchy following observation of each sensory pattern (or minibatch of patterns). This innovative learning algorithm can be used in place of the traditional greedy, layer-wise learning algorithm in order to accurately track the developmental trajectory of the model. This allows to study how global properties of the network can gradually emerge during the course of learning, at the same time enabling a systematic comparison with biological developmental trajectories observed in empirical studies.

The proposed algorithm was first evaluated on the popular MNIST benchmark, where DBNs trained using the greedy algorithm have traditionally achieved very good performance. We showed that the DBN trained using our iterative algorithm was able to achieve a performance comparable to the greedy counterpart, both in terms of readout accuracy from the internal representations and in terms of reconstruction capabilities. We also probed the final models on a variety of generative tasks, which assessed the DBN ability to reproduce, complete and denoise corrupted input images. Also in this case, we did not observe significant differences between the greedy and iterative versions of the learning algorithm. On the contrary, the attempt to implement joint training of all weights by simply propagating signals across the full stack both in the bottom-up and top-down learning phases led to poor convergence and unsatisfactory results. This reveals that, subsequently to the feed-forward propagation across the entire DBN hierarchy (which resembles the fast feed-forward sweep observed in cortical circuits [42, 43]), neurons’ activation at the deepest layer cannot be fed all the way back to the visible layer in a (symmetrical) fast feedback sweep but need to be locally processed at each layer to compute the learning signals (as implemented in the iDBN). Notably, the latter scheme is also consistent with local recurrent processing supported by recurrent and horizontal connections within cortical areas [42, 44].

To further support the cognitive plausibility of the proposed developmental scheme, we also implemented a straightforward interleaved training approach to tackle continual learning tasks, demonstrating that the iDBN can be also extended to challenging scenarios that require to incrementally incorporate new knowledge in the deep network. This paves the way to the investigation of more plausible continual learning schemes that exploit the top-down generative properties of the DBN to sample the stimuli that are needed for interleaved training [57]. Moreover, we carried out an extensive analysis on the progressive development of structural properties in the DBN, by investigating the emergence of a variety of graph theoretical properties, such as degree, geodesic distance and connected components. Our iterative learning approach allowed to emphasize the gradual refinement of these properties at the global level, thus opening the possibility to more systematically study the topological development of such complex hierarchical systems.

We finally evaluated our iterative approach on a data set that has been recently exploited in a variety of cognitive models to simulate the perception of visual numerosity. Also in this case, the iDBN achieved a final accuracy comparable to the greedy counterpart, at the same time allowing for a precise tracking of the progressive development of the internal representations of the model. Behavioral performance, supported by a linear readout from the internal representations, can also be continuously tracked to monitor skill acquisition. Indeed, the learning curves resulting from our approach closely resemble the learning curves reported in experimental studies with human children, and also overlap with those reported in a recent developmental model based on deep autoencoders.

Conclusion

The scope of the present work was twofold. On the one hand, we presented a novel unsupervised learning algorithm for deep belief networks that allows to accurately track the progressive development of the internal representations of the model. We validated our algorithm on two prototypical benchmark domains, achieving results comparable to the state-of-the-art. On the other hand, we demonstrated that our iterative learning algorithm can be extended to more realistic learning scenarios, at the same time supporting the psychometric analysis of progressive changes in behavioral performance and the study of the gradual development of network properties from the point of view of network theory.

Future studies might take advantage of the proposed iterative algorithm to investigate how the emergence of developmental disorders can be related to impairments in both the initial conditions of the system and the subsequent learning phases. This would be particularly relevant for the study of widespread learning disabilities such as dyscalculia, for which we are still lacking a computational characterization [30]. It would also be useful to explore whether the proposed iDBN approach could be applied in other cognitive domains involving unsupervised deep learning. This would be valuable not only from a machine learning standpoint, but also from a cognitive modeling perspective: Indeed, our iterative algorithm allows to create developmental models that can be quantitatively validated against empirical data collected on humans. Similarly, the proposed graph analysis could be applied in developmental physiology to better understand how structural and functional properties of self-organizing networks might gradually emerge from unsupervised learning dynamics.