Introduction

Network theory is a popular and powerful tool for modeling and explaining the emergent properties of various complex systems in many different fields, such as economics, social sciences, biology, transportation, neuroscience, and beyond1,2,3,4. Considerable efforts have been devoted to designing models capable of grasping and reproducing the properties of real networks as closely as possible. For example, a mechanistic approach can be followed, in which one specifies a set of microscopic rules (usually involving the creation/deletion of nodes and links) used to grow or evolve the network5,6. It is—typically—relatively simple to produce samples of networks from a particular mechanistic model. Although such a framework allows the incorporation of specific knowledge domain and expertise to design network formation rules, mechanistic models are still burdened by the human understanding of the problem, meaning that while generative routines can be very rich and flexible, they are inherently limited and capable of synthesizing only some particular ensembles of networks7,8. A more general alternative is offered by probabilistic approaches, such as the so-called Exponential Random Graph Model9, where networks are generated according to a given probability measure. Network generation is tightly entangled with the related topic of network comparison. Indeed, only by introducing a specific similarity measure it is possible to say whether the artificial network generated by a model is akin to an observed real-world network. Many different options are available to compare two networks, from measuring the distance between adjacency matrices to more sophisticated metrics and observables, often based on information theory principles10,11,12,13,14. Some of the most common observables generally focus on local (e.g. degree distribution, local clustering) or global (e.g. diameter—the largest shortest path—) properties. Mechanistic generators try to replicate a subset of network properties but, unfortunately, a proper sufficient subset of observables that would ensure a realistic correspondence of the synthesized networks with the observed ones is still lacking. In this work, we propose using the intrinsic dimension (ID) of the network at different scales to compare simulated and real world networks. The ID of a dataset was originally introduced for characterizing dynamic systems, but has now become a key concept in the realm of manifold learning. In fact, data points are typically embedded in high-dimensional spaces characterized by a large number of features, among which a certain degree of (possibly nonlinear) interdependence exists. For this reason, the data points are believed to belong to a lower dimensional manifold defined by a small number of independent variables. The number of such variables is the ID. Indeed, in most real-world datasets it is possible to describe the data, at least locally, using a number of coordinates which is much lower than the number of features15,16,17. Very interestingly, the estimated ID depends on the scale at which a given system or dataset is observed, reflecting the complexity of the data structure at different resolution levels. For example, data laying on a line and perturbed by d-dimensional noise have an ID of order d at a small scale, and of order one at a large scale. The motivating idea of this work is to exploit the rich information encoded in this scale dependence to characterize the properties of unweighted networks.

The concept of estimating the dimensionality of a network is not new; the first attempts to characterize the dimension of a network can be found in18,19 and refined in successive studies20,21. This concept is partially related to fractality and self-similarity, properties that have been extensively explored in networks22,23 and that can currently be characterized by various mathematical tools24,25,26 often relying on instruments inspired by manifold learning techniques such as the Box-Counting27,28 or the Correlation Dimension29,30. However, to our knowledge, in all methods introduced so far the discreteness of the distances between the nodes was not taken explicitly into account. For example, renormalization methods for determining the fractal dimension simply cover the nodes without considering whether the edges are weighted or not, often assuming that the distance between nodes is a real number. If applied to unweighted networks, where the distance between vertices are computed as the shortest path and the edges are unweighted, these approaches may be affected by systematic errors. Moreover, and possibly more importantly, available approaches do not allow estimating the ID as an explicit function of the scale, a feature that is essential if one wants to use the ID as a network fingerprint. To overcome such limitation, we propose the use of an ID estimator specifically built for data spaces in which the distances can only take discrete values31. Some examples of the ID as a function of the scale, which we name ID signature, estimated by our approach are reported in Fig. 1. The networks were obtained by first sampling points uniformly at random on a low-dimensional manifold which is then embedded in a high-dimensional space by adding coordinates that are all zero and, subsequently, all coordinates are perturbed with Gaussian noise. Finally, the network is created by connecting each point to a number of neighbors that is fixed, on average, to a given value. In the blue and green networks, only the first coordinate is uniformly sampled over the interval (0,10) and 99 fictitious zero coordinates are added; the added Gaussian noise has standard deviations of, respectively, \(\varepsilon =10^{-4}\) and \(\varepsilon =5\cdot 10^{-3}\). The ID at a large scale is close to one, corresponding to the dimensionality of the underlying manifold emerging when noise becomes irrelevant; at a smaller scale, the ID is determined by the number of neighbors that are linked to create the network structure, namely, by the degree, and by the variance of the Gaussian noise. The green network, in which the average degree is fixed to 10 and the variance of the noise is large, the ID is 5 at distance one, but then it becomes significantly larger, and reaches the value of 1 only at a scale of \(\sim\) 10. For the blue network, which has the same average degree as the green one but a smaller noise, the ID is also equal to 5 at distance 1 but then quickly plateaus to 1. Finally, the orange network, for which both the first and second coordinates are uniformly sampled in the interval (0,10), has the average degree fixed at 6 and a small noise, so that the ID is \(\sim\) 3 at a small distance, and very quickly converges to two.

Figure 1
figure 1

The ID signature of networks built upon points embedded in a metric space is consistent with the ID of the underlying manifold from which the points were extracted. Points are uniformly sampled in the interval (0,10) (blue and green curves) or in a square with a side length of 10 (orange curve), embedded in 100 dimensions by adding 99 or 98 zeros, and then Gaussian noise with standard deviation \(\varepsilon\) is added to all the coordinates. We then create the network by connecting each 100-dimensional point to a random number of neighbors extracted from a Poisson distribution with given intensity (i.e. mean value) \(\lambda\). \(\lambda\) is 10 for the green and blue networks and 6 for the orange one. While at high scales the ID signature reaches a plateau corresponding to the expected value (namely 1 or 2), at small distances the ID depends on the given combination of \(\varepsilon\) and \(\lambda\). In particular, the larger is the noise, the greater the dimensionality estimated at small scales, as the low-dimensional structure becomes temporarily hidden from the nonnegligible noisy coordinates.

The rich behavior of the ID as a function of the scale observed in artificial networks that emerges from Fig. 1 prompted us to introduce an approach that allows inferring the parameters of a mechanistic model able to reproduce a target ID curve. This is achieved by wrapping a generative model within the framework of the Approximate Bayesian Computation32,33 (ABC). We will show that networks generated using ABC in such a way that their ID signature reproduces that of a target network will be statistically similar/close to the target network according to degree distribution, closeness, betweenness, clustering coefficient and page rank, all observables that are not directly controlled for in ABC. We apply our algorithm on benchmark systems, comparing its performance on simple generative models, and show that ABC using the ID signature as summary statistics, allows the estimation of the ground truth generative parameters. This occurs even for more sophisticated and flexible mechanistic models, with up to 8 parameters. For some classes of networks, we find that the ID signature alone, even if well reproduced, does not allow us to uniquely identify the original parameters with a sufficiently narrow posterior; it leads, instead, to a class of estimates—a manifold in the space of parameters—that, nonetheless, always includes the generative parameters associated with the observed network. We also find that, for other networks characterized by a large diameter, it is impossible to reproduce the reference ID signature with generative models available in the literature and that we tested. Therefore, we developed a new generative network algorithm in which the ID at different scales is dynamically controlled while the network is built. We will show that this approach allows the generation of ensembles of networks whose properties are statistically similar to real-world networks, even when standard generative models fail.

Methods

In this section, we present the three ingredients of our approach: the discrete intrinsic dimension estimator, the ABC algorithm and the ID-guided generative network model.

Estimating the ID of a graph

We infer the ID of a network from the probability distribution of the distances between the nodes as measured by the number of links to go from one node to another following the shortest path. Since these distances, in an unweighted network, can only take discrete values, we use I3D31, an approach specifically developed for estimating the ID in datasets in which the features can only assume integer values. In this approach, one chooses two integers \(R_1\) and \(R_2>R_1\), and counts, for each data point i, the number of data points whose \(L^1\) distance from i is smaller than or equal to \(R_2\) (denoted by \(k_i\)), and the number of data points whose distance from i is smaller than or equal to \(R_1\) (denoted by \(n_i\)). In Ref.31,34 it is proven that, under some regularity assumptions, \(n_i\) is a realization of a binomial random variable where \(k_i\) is the number of trials, and the success probability p depends only on \(R_1\), \(R_2\), and the dimension d of the square lattice which contains the data: \(n_i | k_i\sim \textrm{Binomial}(k_i,p(d,R_1,R_2))\). An explicit formula for p can be derived by the Ehrhart theory of polynomials35,36. In particular, \(p(d,R_1,R_2)={V_\diamond (R_1,d)}/{V_\diamond (R_2,d)}\) where \(V_\diamond (R,d)\) is the number of points that would be observed within a distance R on a square lattice of dimension d:

$$\begin{aligned} V_\diamond (R,d)= \left( {\begin{array}{c}d+R\\ d\end{array}}\right) \;_2F_1(-d,-R,-d-R,-1). \end{aligned}$$
(1)

where \(_2F_1(\cdot ,\cdot ,\cdot ,\cdot )\) is the ordinary hypergeometric function. Extending the reasoning from a single reference point i to the whole dataset, and assuming independence among the N random variables \(n_i\), one can write the conditional probability of observing the vector of \(n_i\)-s given the values of \(k_i\)-s and given p:

$$\begin{aligned} \mathcal {L}=\prod _{i=1}^N\textrm{Binomial}(n_i|k_i,p(d,R_1,R_2)). \end{aligned}$$
(2)

where \(\textrm{Binomial}(n|k,p)\) denotes the probability mass function of a \(\textrm{Binomial}(k,p)\) random variable evaluated in n. One then estimates the ID by maximising \(\mathcal {L}\) with respect to d (or, alternatively, by taking the average of the posterior, which, assuming a Beta conjugate prior, is a Beta distribution31). This amounts to finding the root of

$$\begin{aligned} \frac{V_\diamond (R_1,d)}{V_\diamond (R_2,d)} - \frac{ { \langle n \rangle } }{ { \langle k \rangle } } = 0. \end{aligned}$$
(3)

where \({ \langle n \rangle }\) e \({ \langle k \rangle }\) are averages computed over all nodes in the network. By varying \(R_2\) and imposing \(R_1=r\;R_2\), where \(0<r<1\) is the only free parameter of the model, one obtains the value of the ID at different scales \(R_2\). In the rest of the work, we set \(r=0.5\).

The approach we just described allows estimating the ID using only distances, and can therefore be applied with no modification to estimate the ID of an unweighted network. The ID estimated in this manner can practically be thought of as the dimensionality of the lattice on which a subgraph centered on node i and including only the nodes up to a distance \(R_2\) could be (approximately) embedded. While this ID can in principle be different in different nodes, we here assume it to be constant. The goodness of the estimate is verified, for each \(R_2\), by model validation (see Supp. Inf.). In Supp. Inf. we also show that correlation effects, related to the values of \(n_i\)-s on different nodes of the same network, do not significantly affect the estimate of the ID obtained by maximizing the likelihood (2), which is written under the assumption of independence of the different counts.

Importantly, as shown in the examples in Fig. 2, the dimensionality of such an embedding lattice can be different at different scales. For instance, in the network representing yeast protein interactions, the ID at a small scale is of order 4, while, at a larger scale, decreases rapidly to 0. In fact, at large scales, every system looks like a zero-dimensional point. In general, each network has a peculiar and specific ID signature, which, as we will see, can be considered as a fingerprint that allows identifying many properties of the network itself. Differently from other observables that typically focus on local (e.g. degree distribution, local clustering) or global (e.g. diameter) properties, the ID signature instead spans short-, meso- and long-range scales.

When dealing with geometrical networks, namely, graphs built from points contained in a Riemannian manifold of a given dimension, the ID has a proper quantitative meaning. As the scale becomes larger, the ID tends to the dimension of the underlying manifold, as shown by the plateaus in Fig. 1. These plateaus identify an ID that is constant over a wide range of scales. However, one of the strengths of network theory is its ability to describe processes and systems that cannot be thought of as embedded in an underlying geometrical manifold, defined by explicit coordinates. In these cases, the ID curve is typically a complex function of the scale, exhibiting no clear plateau. As will be discussed in further detail, the ID curve is extremely sensitive even to tiny changes in the network topology. For example, it is easy to change it almost entirely by adding even a single extra link. The reason behind this sensitivity is the complex relationship between the edge structure and the distances between nodes: Moving a single edge can create a link between two clusters of nodes that would otherwise be far apart, dramatically changing the vertex distance matrix, even if, for instance, the degree distribution is kept identical.

Our approach is based on the shortest path distances between nodes, meaning that self-loops or parallel edges (i.e., more edges connecting a couple of vertices) would not change the ID estimation. The same algorithm can be applied to directed networks, as the only difference from undirected ones is that the distance matrix is not symmetric. However, to keep simple and intuitive examples, we focus on simple (loop-less and single-edge) undirected networks.

Approximate Bayes computation

To illustrate the usefulness of the ID signature as a fingerprint of a network, we demonstrate that it can be used as a summary statistics to estimate the parameters of a generative model capable of producing networks with properties that are very similar to those of a given target network. Among the available methods to estimate the parameters of the generative processes, we resorted to Sequential Monte Carlo Approximate Bayes Computation37,38, a flexible and adaptable scheme that, with respect to other estimation methods, also provides uncertainty quantification. Approximate Bayes Computation (ABC) has been successfully employed in many different fields32,33 and in the context of network theory39,40,41. The simplest ABC algorithm is the so-called Rejection ABC, which aims at inferring a posterior distribution \(f(\theta |x_0)\) for a set of parameters \(\theta\) given some observed data \(x_0\) when the likelihood of the model cannot be evaluated explicitly but can be easily sampled. Given a putative vector \(\theta ^*\) extracted from a prior distribution \(\pi (\theta )\), one generates a simulated network x from the model (encoded by a likelihood function f) conditioned on the sampled value of the parameter, \(f(x|\theta ^*)\). The proposed \(\theta ^*\) is accepted if the simulated data are “close” to the observed data. Such closeness can be assessed by computing the distance between some summary statistics S(x). One thus approximates the posterior according to \(f(\theta |x_0)\sim f(\theta \,|\,\rho (S(x),S(x_0))<\varepsilon )\) where \(\rho\) is a suitable distance measure and \(\varepsilon\) is a tunable tolerance. In the limit of \(\varepsilon \rightarrow 0\) and if \(S(\cdot )\) is a sufficient summary statistics the approximate posterior becomes exact. This procedure alone is however very inefficient, and many improvements have been proposed42. As already mentioned, here we will exploit the Sequential Monte Carlo extension of ABC (see43 for details about its advantages and drawbacks). Briefly, within this framework, a population of parameters \(\theta _1,\dots ,\theta _n\) (termed particles) evolves from an initial prior distribution through a sequence of intermediate distributions that tend to converge to the true posterior. Along this history, the tolerance \(\varepsilon\) is reduced at each step according to a tunable scheme, resulting in Algorithm 1:

figure a

where \(\rho _t(x_0)\) are the distances of the summary statistics of the accepted simulated networks from the reference summary statistics \(S(x_0)\) at generation t. The algorithm runs until \(\varepsilon _t > \varepsilon\), the “calibration sample” consists of accepting n particles extracted from the prior, and \(\varepsilon _1\) is obtained from this sample as the \(\text {median}(\rho _0(x_0))\). The proposals \(\theta ^{\text {prop}}\) at time t are extracted by means of a kernel density estimate, so that \(g_t(\theta |\theta _{t-1})\) is a multivariate Gaussian distribution with mean and covariance given by the mean and covariance of the particles accepted at step \(t-1\) (further technical details are provided in the SI).

In principle, instead of resorting to a Bayesian approach, the estimation of the parameters of a generative model can be tackled by means of optimizations methods, namely the Stochastic Path Integral44 or the Pareto Simulated Annealing45. Identifying the most suitable one for the specific task of matching the ID signature of a generative model will the object of future research. For the time being we decided to stick to the SMC-ABC as, firstly, it provides statistically rigorous uncertainty quantification, a feature which is often absent in optimization methods. Secondly, it allowed us to reliably explore all the relevant parameter space using moderate computational resources through a simple and straightforward integration of our model within the well documented pyABC library46.

ID-guided generative model

Our main idea consists of using the ID signature as a vectorial summary statistics for the ABC procedure, and to use the \(L^\infty\) metric between the reference (observed or ground truth) ID signature and the signature associated with the generated networks x:

$$\begin{aligned} \rho (S(x),S(x_0)) = \max _{R < \Delta (x)}|\text {ID}_R(x) - \text {ID}_R(x_0)| : = \mathcal {D}(x,x_0) \end{aligned}$$
(4)

where \(\Delta (x)\) is the diameter of the network and ID\(_R(x)\) is the ID of network x computed at \(R_2=R\).

As shown in the “Results” section, the classical ABC procedure, with the ID as a summary statistics, works very well in many test cases, in artificial settings when the target network is obtained through a generative algorithm, but also for real-world networks characterized by a low diameter. However, generative models available in the literature that we have checked all fail to reproduce the ID signatures of real-world networks with low IDs and high diameters.

In some cases, the standard generative processes fail, possibly due to a highly nonlinear relationships between the links (see Results for specific examples). To address this problem, we designed a new network generative process that, during the construction, compares the current ID signature with the reference signature. In particular, for each move proposing the addition of an edge, we compute the ID signature, and the move is accepted following the Metropolis-Hastings algorithm, i.e., with a probability proportional to:

$$\begin{aligned} \min \Bigg (1,\exp \bigg [\beta \big (\mathcal {D}(x',x_0)-\mathcal {D}(x,x_0)\big )\bigg ]\Bigg ) \end{aligned}$$
(5)

where \(\beta\) plays the role of the inverse of a temperature, \(x_0\) is the reference network, x is the last accepted network and \(x'\) is the network with the newly proposed edge.

This procedure is similar to that proposed for the dk-models in47. In that paper, the authors, in order to generate a random graph that tries to reproduce an observed one, take into account the original average degree, the degree distribution, the joint degree distribution, the average clustering coefficient and the degree-wise average clustering coefficient, and perform edge swapping while targeting the aforementioned statistics. They show that, in several cases, by limiting the dk-series at the 2.5k level (which means taking into account local properties) also other network observables at the meso- and macroscopic scales are reasonably well matched. However, already in the supplementary information of the same paper, it is reported that this does not happen for the BRAIN network, in which betweenness and shortest path length are not reproduced. The same occurs for the US power grid network48—one of our test cases—even if the 3k statistics is also accounted for. We think this is not by chance. Indeed, such graphs are characterized by a particularly high diameter, a feature that, as we will show, cannot typically be obtained by available models.

In this work, we focus exclusively on the ID signature as our unique summary statistics, and let the network grow from scratch. In each step, the network growth process is guided by meaningful and interpretable edge addition actions. Also this model is integrated within a Sequential Monte Carlo Approximate Bayesian Computation framework to estimate the optimal \(\beta\) and the vector of probabilities associated with the edge addition actions.

Results

ID signature of real-world networks

The first result we present concerns the staggering variety of ID signatures of real-world networks. In Fig. 2, we report five representative examples of social networks, a protein interaction network and infrastructure networks. The ID at distance 1 is directly related to the average degree. Indeed, let us consider Eq. (3): for \(R_1=0,\;R_2=1\) we get \(\frac{1}{2d+1} - \frac{1}{ { \langle k_1 \rangle } +1} = 0\), from which \(d= { \langle k_1 \rangle } /2\), where \(k_R\) indicates the distribution of neighbors at distance R, so that \(P(k_1)\) is the canonical degree distribution and \({ \langle k_1 \rangle }\) is thus the average degree. For larger R the estimator considers the average number of neighbours up to that distance. The common (more or less pronounced) observed increasing trend of the ID at a small scale is due to the exponential growth of the number of neighbors of each vertex with the radius R. In particular, a sharp rise to high ID values in this regime is typically associated with the presence of hubs, nodes characterized by a high degree that foster high connectivity and enlarge the size of neighborhoods at small distances. Accordingly, for vertices connected to such hubs, the number of neighbors at distance 2 or 3 is exponentially larger than their degree. This is the case for Google+ and Facebook companies networks, reported in Fig. 2, where the degrees are (approximately) power-law distributed and the IDs show a narrow peak, reaching values of 20 and 12 respectively. Conversely, in cases where the degree distribution is more uniform and hubs are absent, the ID signature at the mesoscale region is smoother and can present a quasi plateau across a wide range of distances. This occurs, for instance, in the yeast protein network and in the US power grid network, where the ID is, respectively, of order 4 for distances between 2 and 7 and of order 3 for distances between 7 and 17. At scales comparable with the average path length, the ID curve typically peaks and then starts declining, since the growth of neighbourhoods’ size becomes subpolynomial. The largest scale that is still meaningful coincides with the largest shortest path, which is typically called diameter.

Figure 2
figure 2

ID signatures for real-world networks. Such a variety of ID profiles makes this observable a good candidate for characterizing different graphs. From the top left, clockwise direction: Amazon recommendations, Google+, Facebook companies, yeast proteins, US power grid. All networks were downloaded from49 and some of their summary statistics can be found in Table 1.

Table 1 Comparison of networks in Fig. 2 according to some characteristics. N: number of nodes, E: number of edges, l: average shortest path, \(\Delta\): diameter (of the connected component), D: largest degree.

The networks were downloaded from49, where further references and details can also be found.

ID-guided ABC for the Erdös–Rényi generative model

To assess the goodness of the ID signature as a summary statistics, we first check whether it can be used to retrieve the parameters chosen to create a reference network via a simple model. All the simulations involving ABC were performed using the PyABC package46. We first consider the Erdös–Rényi50 (ER) model, one of the simplest and most common random-graph models. According to the ER model, each possible edge of a network with N vertices is independently added with probability p, implying an average number of edges \({ \langle E \rangle } = \left( {\begin{array}{c}N\\ 2\end{array}}\right) p\). For our first experiment we set \(N=300\), \(p=0.01\) (and thus \({ \langle E \rangle } =448.5\)) and extract the ID signature of a single ER graph realization. This is the reference summary statistics that we want to reproduce. In this first test we simply check whether it allows us to properly estimate the value of p used to generate the related network.

Figure 3
figure 3

The ID signature used as a summary statistics allows us to retrieve the ground truth parameters used to generate the reference network for different generative mechanistic models. In each row, the left panels show the evolution of the ID through the SMC generations, while on the right we display the successive posteriors associated with the accepted particles. The evolution of \(\varepsilon _t\) is reported only in the first row, as the same behaviour is observed in all other cases. For all models, the target tolerance threshold is \(\varepsilon =0.05\), and the number of particles is \(n=50\). The first row shows the Erdös–Rényi model (depending on 1 parameter p), with the following simulation details (with reference to Algorithm 1): number of nodes \(N=300,\;p=0.01\), prior distribution \(\pi (p)=\text {unif}[0,0.025]\). The second row displays the Planted Partition model (2 parameters, see text for model details), where we set the number of communities \(l=10\), the number of nodes per community \(k=30\), \(p_{in}=0.25\) and \(p_{out}=0.025\). The third row is a realization of the ABNG51 model with 6 parameters. The prior is given by a uniform distribution on the simplex defined by \(\sum _i^6p_i=1\).

The results are reported in the first row of Fig. 3. The panel on the left shows the evolution of the average ID (with its standard deviation) throughout successive ABC generations; the central panel reports the successive posterior approximations of p; the panel on the right displays how convergence to the target \(\varepsilon\) occurs in 10 generations, following exponential decay. In particular, we observe that the average ID of the sampled networks is far from the target for the first 3 generations, and, accordingly, the associated posterior distribution is still broad. Together with the decreasing of the distance between the reference ID signature and the one of the sampled networks, we can see a sharpening of the posterior around the ground truth value \(p=0.01\). For readability reasons, we reported the evolution of the average ID and the posterior up to \(t=5\), as successive generations are practically indistinguishable in the plots, and would make the figure less clear.

ID-guided ABC for artificial networks

The results obtained for the ER model were expected, as the average number of links \({ \langle E \rangle }\) or, equivalently, the average degree, which are sufficient statistics for this model, are encoded in the ID signature at distance \(R=1\). Next, we considered less trivial generative models: the Non-Linear Preferential Attachment52 (NLPA), the Watts–Strogatz53 (WS), and the planted partition54 (PP). In such models, trivial sufficient summary statistics are not known. Here, we show that ABC, with the ID signature as the unique summary statistics, returns samples from the posterior that are centered around the parameters chosen as the ground truth. In the second row of Fig. 3 we report the results obtained for a PP graph, a model meant to build interacting communities. Once the number of communities and the number of elements per community are fixed, two more parameters, \(p_{in}\) and \(p_{out}\), regulate the probability of connecting vertices within a community and among communities. We followed the same Algorithm 1, by appropriately considering the higher dimensionality of the parameter space (equal to 2 in this case). The convergence of \(\varepsilon _t\) to the target \(\varepsilon\) is exponential, similarly to the ER case and, thus, not reported. Differently from the previous example, we can appreciate how the average ID signature is apparently close to the reference one, already from the first generation (left panel). However, the wide standard deviation implies a diversified population of graphs and, accordingly, a broadened posterior distribution. The narrowing of the standard deviation is then paired with the concentration of the posteriors around the ground truth parameters. The fact that, to achieve reasonable precision on the posterior, one needs the whole population of sampled graphs to have similar ID signatures hints at the meaningfulness and power of such summary statistics, which is sensitive even to small variations of the generative parameters. The results for the other mentioned models are qualitatively similar and are thus reported in the SI.

To more closely mimic the generative mechanism behind real-world networks, we then considered more flexible and rich generative models. Many methods have been proposed throughout the years, including the dk-random graphs47, Chung-Lu55,56 and the Exponential Random Graph Models (ERGM)9. We here use the so-called Action Based Network Generators (ABNG)51, a mechanistic model that proposes the addition of edges according to intuitive and interpretable actions, which are based on well-known network properties. At each iteration of the generative process, a vertex is randomly chosen and a new link toward another node is added according to the probability associated with different possible actions. For instance, one of those consists of creating a triadic closure; another builds the edge in relationship to the degree of the target or to the degree of the target’s neighbors. For different combinations of the aforementioned probabilities associated with the actions, the model gives rise to networks with radically different structures. In Ref.51 the Authors show the flexibility of this model by reproducing the most common random graphs and a wide class of real-world networks. In particular, they estimate the probabilities of actions through Pareto Simulated Annealing45, fitting the following summary statistics: degree distribution, page rank, betweenness centrality and clustering coefficient. In contrast, as already stated, we calibrate this generative model using ABC with the ID signature as the only summary statistics.

As a preliminary step, we verify that, given certain ground truth probabilities and a reference network realization, we could reproduce its ID and correctly estimate the original parameters. In the bottom row of Fig. 3, we report the results obtained on artificial networks generated using 6 different actions. As in previous cases, the reference ID is perfectly retrieved and the ground truth parameters are within the confidence level of the posteriors.

In this case, it must be noted that we are trying to infer 6 parameters, meaning that at least 6 constraints need to be enforced. For this reason, it is important that the scale at which the ID can be computed is larger than the number of parameters one wants to infer. Actually, since the IDs computed using Eq.(3) are exploiting cumulative neighborhoods, it means that the information extracted at a given distance r will also be used in the estimations for \(R>r\). As a consequence, in principle, one wants the diameter of the network to exceed (possibly by far) the number of parameters to be estimated.

ID-guided ABC fails to fit the ID of high-diameter networks

As a final step, we then moved to real-world networks, where a ground truth generative model is not available. We start from a subset of the Facebook network, obtained by selecting a random vertex and its 500 nearest-neighbors, which are linked if the edge is also present in the original graph. The ID of the subset network resembles the one computed on the whole-network: a sharp increase in the ID at small scales and a small diameter, of order \(\sim 10\). The results of the ABC protocol are shown in the first row of Fig. 4 (for clarity and readability, we present only the last generation of the ABC routine). One can appreciate how the ID (in blue) is fairly reproduced but does not reach the target \(\varepsilon =0.05\). To assess the quality of the ensemble of networks produced in the last generation of the ABC procedure, we extract 6 different well-known network properties: degree, betweenness, page rank, clustering coefficient, closeness, and eigenvector centrality. These observables are computed on each vertex, defining a set of 6 distributions that is then compared to the corresponding values of the reference network by means of the Kolmogorov–Smirnoff (KS) statistics. The mean of the KS distances for the ensemble of networks of the last generation is reported in the panel on the right. The same plot with error bars is shown in the SI. The picture suggests that while some of the other observable distributions are close to the reference one (KS distance \(\lesssim 0.2\)), others are not (KS distance \(\gtrsim 0.2\)). However, we observed in previous cases that, to properly estimate the generative parameters, the average ID of the ensemble of the generated networks has to be very close to the reference ID signature, together with a small variance. It is possible that, by getting closer to the reference ID, the agreement concerning the other properties might improve as well. We will return to this example in the next section, when exploiting our new generative model.

Figure 4
figure 4

Performances of ABNG against ID-guided ABC and ID-guided GM for two real-world and one geometric networks. The panels in the left column compare the different ID signatures, while the panels on the right represents the average of the KS distances between the ensemble of sampled networks against the reference networks for 6 different typical graphs properties. The first row concerns a subset of the Facebook company network. In this case, the ABNG model allows to reproduce quite faithfully the ID even if fitted onto observables different from the ID. The ID-guided ABC gets closer to the ID but does not reach the target \(\varepsilon =0.05\). In contrast, the ID-guided GM allows the reference ID to be perfectly matched. Accordingly, also the KS distances typically improve, from just slightly to even dramatically, as in the case of the closeness, even if not fitted explicitly. The second and third rows represent the same scenario for a 2-dimensional network and the US Power Grid. In such cases, both ABNG and ID-guided ABC fail to even approach the target ID, so that the ID-guided GM sensibly outperforms the other models. As observed in previous scenarios, the KS distances tend to improve, especially in the 2-dimensional case. The green values of the ID and KS were obtained using the ABNG model by explicitly fitting 4 of the 6 properties shown in the KS distance plot (see main text). The error bars on the KS distances are shown in the SI.

Next, we consider the US Power Grid. To begin with, we start by looking at the ID signature associated with the graphs generated using the optimized parameters reported in51. The average over 50 different network realizations (green curve and green circles in the bottom panel of Fig. 4, label: ABNG) shows that the related ID is substantially different from the observed one (black dashed line, label: Observed). In particular, the diameter happens to be much smaller than expected. This is not so surprising, as the suggested parameters were found through an optimization performed by fitting local observables, the ones that also enter as actions in the ABNG model. Accordingly, the meso- and global-scale structures are not properly reproduced. In fact, by examining the associated average KS distances, the quantities used as a target for the fit (betweenness, page rank, clustering coefficient and degree) display very low values (\(\lesssim 0.1\)), despite the ID curves being very different. Seemingly, eigenvector and closeness centrality measures as well present very high KS statistics.

The blue curve in the bottom left panel of Fig. 4 (label: ID-guided ABC) is the average ensemble ID to which the ID-guided ABC converged after 10 generations, and it represents the closest ID signature that we could reach with this procedure. Even if we manage to find an ID that is consistently closer to the target, the result is again far from satisfactory. Indeed, the obtained diameter is still too low, meaning that the simulated networks are too compact. At the same time, local observables have not consistently worsened (with the exception of the local clustering). This behavior has its origin in the actions used to grow the network. Those are, in fact, based on intuitive mechanisms that leverage on local and neighborhood properties. None of them is explicitly built to enlarge the graph’s diameter or enforce global properties. This is a paradigmatic example of the limitations of even the most advanced generative models: their high flexibility allows to sample a wide ensemble of different networks that can reproduce certain network properties. However, other properties, especially of large-diameter networks, are practically impossible to obtain. This observation prompted us to attempt to use a different generative mechanistic model.

Matching the ID curve during the network generation

The reason behind the failure of ID-guided ABC is that the distance matrix—on which ID estimations are based—is a very complicated function of the edges, so that the simple addition of a single edge can dramatically change such a matrix. As a consequence, building intuitive and understandable actions that add edges without compromising the global network structure is far from trivial. For instance, it is very difficult avoid creating bridges/shortcuts among far vertices, whose addition dramatically shortens all distances. To address this problem, we exploit a generative process in which the ID curve is built dynamically, accepting only those moves bringing the ID of the network under construction closer to the observed one (see section “Methods”). To assess the validity of our methodology, we started by applying our ID-guided generative model (GM) to the subset of the Facebook company network that was already presented in the previous section. According to the orange curve and KS distances in the first row of Fig. 4, the ID is now within the desired threshold of \(\varepsilon =0.05\) and the associated KS distances are typically comparable or lower, with a neat improvement especially for closeness, where the mean decreases from 0.3 to 0.05. The worst reproduced property is then the local clustering, with a median KS distance of 0.2.

The next step consists of dealing with large-diameter networks. To this end, we start by analysing an artificial network built on points embedded in a metric space, in the same fashion as those presented in the Introduction (see section “Introduction” and Fig. 1). Once the reference network was created, we applied the ABNG algorithm, ID-guided ABC (again, using ABNG as a generative model) and the ID-guided GM approach. The results are shown in the second row of Fig. 4. Similar to what occurred in the previous example of the US power grid, both ABNG and ID-guided ABC models do not provide a combination of parameters that allows to satisfactorily reproduce the ID signature (green and blue curves). Conversely, the ID-guided GM matches the reference ID within just 3 generations (orange curve). The set of typical network observables are fairly well matched (KS distance \(\lesssim 0.2\)) apart from the clustering coefficient (KS dist \(\sim 0.3)\) and the eigenvector centrality (KS dist \(\sim 0.5)\). However, the discrepancy for the latter property is not completely unexpected, as such an observable can display very different distributions (and thus large KS distance), even for networks produced from the same generative model using the same parameters. See SI for a discussion and some examples.

We finally applied the ID-guided GM to the US power grid network. Just three generations of ABC sampling were sufficient to reach an optimal agreement between the observed ID signature and the signature associated with the simulated networks, making the whole experiment less demanding than the one employing the pure ID-guided ABC previously described. In particular, in the former case, the moves involving the addition of edges have an average acceptance rate of order 0.2, while the fractions of accepted graphs were in the range 0.17-0.3. Notably, for the last generation of the ID-guided ABC, the fraction of accepted graphs is of order \(10^{-3}\), if not lower. The results are reported in orange in the third row of Fig. 4. The posteriors are quite stable along generations, as the target ID signature is reached gradually during the network generation. Still, if we use only random moves to generate the network, one would need on average 150k Metropolis-Hastings steps to achieve the target number of edges. Conversely, if the links are added according to some specific interpretable rules (similar to those provided by ABNG), as it occurs in our algorithm, only 30k steps are needed. This means that providing structured rules to add the links is still meaningful for increasing the acceptance rate. Very interestingly, apart from the eigenvector centrality, all other measures taken into account display relatively low KS distances from the observed ones. This means that by enforcing the ID signature, one is effectively imposing some restraints that meaningfully affect the local structure, which is fairly well reproduced.

Discussion and conclusions

In this paper, we present a procedure to compute the intrinsic dimension (ID) of unweighted networks by leveraging information from the network at different scales. We then employ the ID as a summary statistic in an Approximate Bayesian Computation (ABC) framework to fit mechanistic generative models. This approach allows for parameter estimation and uncertainty quantification, which significantly contributes to the field. The method can be readily extended to weighted networks with no modification, except that the ID should be estimated with approaches which assume that the distances are real numbers.

We first noted that the ID signature is a powerful observable that allows for the calibration of model parameters. However, we observed that the most advanced and flexible models available in the literature could not correctly reproduce real-world networks’ ID signatures, especially when their diameter is large. To address this issue, we developed a new generative process that dynamically controls the ID signature and allows for the satisfactory reproduction of the ID in real-world networks.

As a byproduct, we observed that artificial networks generated through our algorithm show a local structure which is qualitatively comparable with the observed one. Indeed, the probability distribution of local observables, such as the degree, obtained on artificial networks, are very similar to those observed on the target network. Remarkably, consistency is achieved even if these local observables are not included explicitly in the summary statistics used as a target in the ABC approach. In future works, it would be interesting to compare these results with those obtained for networks sampled from statistical models such as ERGM, where the ID signatures at each radius are used as summary statistics to find the vector of model parameters that define the ensemble from which networks are then sampled.

In fact, the choice of which statistics to consider ultimately depends on whether the analysis aims to focus on small-scale details or to understand the overall properties of the network. In this context, we believe the ID signature and the dk-statistics47 can be effectively used together to achieve a more accurate and comprehensive network representation. The dk model’s ability to accurately replicate local properties complements the ID signature’s strength in capturing the meso and global structure.

A possible application of the algorithm could be in network repair/correction: the ID signature would provide a good target to regenerate the broken connections without changing the large scale connectivity of a graph that is only partially known. To prove this statement we run a simple exploratory experiment. We compare the ID signatures averaged over 50 copies of the analysed graphs with a given fraction \(\alpha\) of randomly chosen edges removed or added. The results are shown in Fig.5. The FB-Companies’ network shows a variation of the ID signature that depends smoothly on \(\alpha\), and is basically “symmetric” under edge addition or removal. This means that, in this network, adding or removing edges does not significantly change the ID signature. Conversely, the ID signature of the 2-dimensional geometric network and of the US power grid are sensitive to link addition and removal in very different ways. Link removal results in very minor variations in the ID signature, as in the case of the FB companies network. Instead, the addition of random links significantly changes the ID signature. We conclude that the ID signature is generally robust under edge removal, even in networks with a large diameter. Additionally, it can detect sudden topological changes when a small number of critical “wrong” edges is added. We plan to further explore the usage of the ID signature as a topological feature in link prediction57.

Figure 5
figure 5

The ID signature is robust under edge removal and sensitive to edge addition. The panels show the ID signatures for ensembles of 50 networks obtained from the original one by the removal or addition of a fraction \(\alpha\) of edges.

Furthermore, our ID estimator could facilitate the task of network representation. The maximum value of the ID could be used as an ansatz for the dimension in which the network should be represented. Moreover, the ID curve can be used as a benchmark to check if the network representation is consistent with the original network. Seemingly, even the complicated task of network lattice embedding58 can exploit the ID signature as a solid starting point.