1 Introduction

Complex systems can often be modeled as dynamic graphs. In these systems, patterns of interactions evolve in time, determining emergent properties, associated function, robustness, and security of the system. There are several broad questions whose answers shed light on the evolution of such dynamic networks: (i) how many bits are required to best describe such a network and its structure (i.e., unlabeled underlying graph); (ii) how to infer underlying dynamic processes governing network evolution; (iii) how to infer information about previous states of the network; and (iv) how to predict the forward evolution of the network state. In this paper we deal with the first question (i.e., labeled and unlabeled graph compression).

To better understand the evolution of network structural properties, several probabilistic models have been proposed, including, e.g., the preferential attachment, duplication-divergence, Cooper-Frieze, and fit-get richer models [2, 6, 10, 24].

Clearly, some models are more suitable to certain types of data than others. For example, it has been claimed that the preferential attachment mechanism [2] plays a strong role in the formation of citation networks [23]. However, due to the high power law exponent of their degree sequence (greater than 2) and lack of community structure [6], preferential attachment graphs are not likely to describe well biological networks such as protein interaction networks or gene regulatory networks [19]. For such networks another model, known as the vertex-copying model, or simply the duplication model, has been claimed as a better fit [25]. In the vertex-copying model, one picks an existing vertex and inserts its clone, possibly with some random modifications, depending on the exact variation of the model [6, 14, 20]. Experimental results show that these variations on the duplication model better capture salient features of protein interaction networks than does the preferential attachment model [22].

In this paper we present comprehensive information-theoretic results for the full duplication model in which every new vertex is a copy of some older vertex. We establish precisely (that is, within a o(1) additive error) the entropy for both unlabeled and labeled graphs generated by this model and design asymptotically optimal compression algorithms that match the entropies up to a constant additive term. Interestingly, we shall see that the entropy of labeled graphs is \(H(G_n) = \Theta (n)\), while the structural entropy (the entropy of the isomorphism class of a random graph from the model, denoted by \(S(G_n)\)) is significantly smaller: \(H(S(G_n)) = \Theta (\log n)\). Thus, the vast majority of information of the labeled graphs in this model is present in the labeling itself, not in the underlying graph structure. In contrast, the entropy of the labeled and generated by, e.g., the preferential attachment model is \(\Theta (n \log n)\) [17].

Clearly, given its simplicity, this model should be regarded as a stepping stone toward a better understanding of more advanced models of this type. The extensions are typically defined by a fixed-probability mix of the full duplication model and other rules, such as no-duplication or uniform attachment. We shall deal with such models in a forthcoming paper.

Graph compression has enjoyed a surge in popularity in recent years, as the recent survey [3] shows. However, rigorous information-theoretic results are still lacking, with a few notable exceptions. The rigorous information-theoretic analysis of graph compression (particularly in the unlabeled case) was initiated by Choi and Szpankowski [5], who analyzed structural compression of Erdős-Rényi graphs (see also [1]). The authors of [5] presented a compression algorithm that provably achieves asymptotically the first two terms of the structural entropy. In Łuczak et al. [17] the authors precisely analyzed the labeled and structural entropies and gave asymptotically optimal compression algorithms for preferential attachment graphs. There has been recent work on universal compression schemes, including in a distributed scenario, by Delgosha and Anantharam [8, 9]. Additionally, several works deal with compression of trees [11, 12, 18, 26].

The full duplication model was almost exclusively analyzed in the context of the typical properties such as degree distribution [6]. It was shown that the average degree depends strongly on the initial conditions [16]. It was also proved that the asymptotic degree distribution fails to converge, yet it exhibits power-law behavior with exponent dependent on the lowest nonzero degree in the initial graph [21]. Other parameters studied in the context of duplication models are the number of small cliques [13] or degree-degree correlations [4]. To the best of our knowledge the entropy and compression of duplication models were not discussed previously in any available literature.

The rest of the paper is organized as follows: In Sect. 2 we define the full duplication model and present its basic properties. In Sect. 3 we establish main results concerning the entropy of the unlabeled and labeled graphs with Sect. 4 being devoted to the construction of algorithms that achieve these bounds within a constant additive term.

2 Full Duplication Model

In this section we define the full duplication model and present some of its properties.

2.1 Definitions

The full duplication model is defined as follows: let us denote by \(G_0\) a given graph on \(n_0\) vertices for some fixed constant \(n_0\). Then, for any \(1 \le i \le n\) we obtain \(G_{i}\) from \(G_{i-1}\) by choosing one of the vertices of \(G_{i-1}\) (denoted by v) uniformly at random, attaching to the graph a new vertex \(v_i\) and adding edges between \(v_i\) and all vertices adjacent to v. Note that v and \(v_i\) are not connected – although if one wants to achieve higher clustering, the results in this paper can be straightforwardly applied to the model in which we add not only edges between \(v_i\) and the neighbors of v, but also between \(v_i\) and v. Observe that \(G_n\) has \(n + n_0\) vertices. Also, properties of \(G_n\) heavily depend on \(G_0\) and its structure, which we assume to be fixed.

Throughout this paper, we will refer to the vertices of the starting graph \(G_0\) as \(\{u_1, \ldots , u_{n_0}\}\) and to all other vertices from \(G_n\) as \(\{v_1, \ldots , v_n\}\). We denote by V(G) and E(G) the set of vertices and the set of edges of a graph G, respectively. Moreover, we denote by \(N_n(v)\) the neighborhood of the vertex v, that is, all vertices that are adjacent to v in \(G_n\). Sometimes we drop the subscript, if the size of the graph is clear from the context.

An example of the duplication process is presented in Fig. 1. On the top, we show the original \(G_0\) on 6 vertices, and on the bottom we plot \(G_3\) with new vertices such that \(v_1\) is a copy of \(u_2\), \(v_2\) is a copy of \(u_1\), and \(v_3\) is a copy of \(v_1\).

Fig. 1
figure 1

Example graph growth in the full duplication model

Here, due to the limited space, we restrict our analysis to asymmetric \(G_0\) (i.e., the underlying automorphism group is of size 1); however, extensions to general \(G_0\) are rather straightforward. We observe that typically even moderate-sized graphs are likely to be asymmetric.

2.2 Basic Properties

Let us introduce the concept of a parent and an ancestor of a vertex. We say that w is the parent of v (denoted by \(w = P(v)\)), when v was copied from w at some time \(1 \le i \le n\). We say that \(w \in U\) is the ancestor of v (denoted by \(w = A(v)\)), when there exist vertices \(v_{i_1}, \ldots , v_{i_k}\) such that \(w = P(v_{i_1})\), \(v_{i_j} = P(v_{i_{j + 1}})\) for \(1 \le j \le k - 1\), and \(v_{i_k} = v\). For convenience we write that if \(u \in U\), then \(P(u) = u\) and \(A(u) = u\). Note that the ancestor of any given vertex is unique. In our example from Fig. 1\(u_2\) is the ancestor of both \(v_1\) and \(v_3\), but only a parent of \(v_1\) and not \(v_3\).

Let now define the set of descendants of \(u_i \in U\): \({\mathcal {C}}_{i, n} := \{w \in G_n:A(w) = u_i\}\) for \(1 \le i \le n_0\). The neighborhood of a vertex is closely tied to its ancestor, as the following lemma shows:

Lemma 1

Let us fix any \(1 \le i \le n_0\). For all \(n \ge 0\)and any \(v \in {\mathcal {C}}_{i, n}\)we have

$$\begin{aligned} N_n(v) = \bigcup _{u_i u_j \in E(G_0)} {\mathcal {C}}_{j, n}. \end{aligned}$$

Proof

We prove this by induction. For \(n = 0\) we have \({\mathcal {C}}_{i, 0} = \{u_i\}\) and the claim holds.

Now suppose that the claim holds for some \(n \ge 0\) and that \(P(v_{n + 1}) = w\). If \(A(w) = u_k\), then \(A(v_{n + 1}) = u_k\). Moreover,

$$\begin{aligned} {\mathcal {C}}_{k, n + 1}&= {\mathcal {C}}_{k, n} \cup \{v_{n+1}\} \\ {\mathcal {C}}_{i, n + 1}&= {\mathcal {C}}_{i, n} \quad \text {for }i \ne k. \end{aligned}$$

We split the remaining part of the proof into several cases:

Case 1, \(i = k\), \(v = v_{n + 1}\)::

by induction hypothesis we have

$$\begin{aligned} N_{n + 1}(v_{n + 1}) = N_{n + 1}(P(v_{n + 1})) = \bigcup _{j: u_k u_j \in E(G_0)} {\mathcal {C}}_{j, n} = \bigcup _{j: u_k u_j \in E(G_0)} {\mathcal {C}}_{j, n + 1}. \end{aligned}$$
Case 2, \(i = k\), \(v \ne v_{n + 1}\)::

similarly,

$$\begin{aligned} N_{n + 1}(v) = N_n(v) = \bigcup _{j: u_k u_j \in E(G_0)} {\mathcal {C}}_{j, n} = \bigcup _{j: u_k u_j \in E(G_0)} {\mathcal {C}}_{j, n + 1}. \end{aligned}$$
Case 3, \(i \ne k\), \(u_i u_k \in E(G_0)\)::

for any \(v \in {\mathcal {C}}_{i, n + 1} = {\mathcal {C}}_{i, n}\) we have

$$\begin{aligned} N_{n + 1}(v)&= N_n(v) \cup \{v_{n + 1}\} = \bigcup _{j: u_i u_j \in E(G_0)} {\mathcal {C}}_{j, n} \cup \{v_{n + 1}\} \\&= \bigcup _{\begin{array}{c} j: u_i u_j \in E(G_0) \\ j \ne k \end{array}} {\mathcal {C}}_{j, n} \cup {\mathcal {C}}_{k, n} \cup \{v_{n + 1}\} \\&= \bigcup _{\begin{array}{c} j: u_i u_j \in E(G_0) \\ j \ne k \end{array}} {\mathcal {C}}_{j, n + 1} \cup {\mathcal {C}}_{k, n + 1} = \bigcup _{j: u_i u_j \in E(G_0)} {\mathcal {C}}_{j, n + 1}. \end{aligned}$$
Case 4, \(i \ne k\), \(u_i u_k \notin E(G_0)\)::

for any \(v \in {\mathcal {C}}_{i, n + 1} = {\mathcal {C}}_{i, n}\) we have

$$\begin{aligned} N_{n + 1}(v) = N_n(v) = \bigcup _{j: u_i u_j \in E(G_0)} {\mathcal {C}}_{j, n} = \bigcup _{j: u_i u_j \in E(G_0)} {\mathcal {C}}_{j, n + 1}. \end{aligned}$$

Therefore, the proof is completed. \(\square\)

This means that effectively \(G_n\) is composed of clusters such that every vertex of i-th cluster is connected to every vertex of j-th cluster if and only if \(u_i u_j \in E(G_0)\). For example, for a graph in Fig. 1b we may identify (marked with ellipses in the figure) the following classes of vertices with identical neighborhoods: \({\mathcal {C}}_{1, n} = \{u_1, v_2\}\), \({\mathcal {C}}_{2, n} = \{u_2, v_1, v_3\}\), \({\mathcal {C}}_{3, n} = \{u_3\}\), \({\mathcal {C}}_{4, n} = \{u_4\}\) and \({\mathcal {C}}_{5, n} = \{u_5\}\).

Let now \(C_{i, n} = |{\mathcal {C}}_{i, n}|\), that is, the number of vertices from \(G_n\) that are ultimately copies of \(u_i\) (including \(u_i\) itself).

It is not hard to see that the sequence of variables \((C_{i, n})_{i = 1}^{n_0}\) can be described as a ball and urn model with \(n_0\) urns. At time \(n = 0\) each urn contains exactly one ball. Each iteration consists of picking an urn at random, proportionally to the number of balls in each bin – that is, with probability \(\frac{C_{i, n}}{\sum _{j = 1}^{n_0} C_{j, n}}\) – and adding a new ball to this urn. It is known [15] that the joint distribution of \((C_{i, n})_{i = 1}^{n_0}\) is directly related to the Dirichlet multinomial distribution denoted as \(Dir(n, \alpha _1, \ldots , \alpha _K)\), with \(K = n_0\) and \(\alpha _i = 1\) for \(1 \le i \le n_0\):

$$\begin{aligned} \Pr \left( (C_{i, n})_{i = 1}^{n_0} = (k_i + 1)_{i = 1}^{n_0}\right) = {\left\{ \begin{array}{ll} n B(n, n_0) &{} \text {if }\sum _{i = 1}^{n_0} k_i = n, \forall _{1 \le i \le n_0} k_i \in {\mathbb {N}}_+, \\ 0 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

where B(xy) is the Euler beta function.

Each variable \(C_{i, n}\) is identically distributed – though not independent, as we know that \(\sum _{i = 1}^{n_0} C_{i, n} = n\) – so we may analyze the properties of \(C_n \sim C_{i, n}\) for every \(1 \le i \le n_0\). Actually, \(C_n - 1\) has the beta-binomial distribution \(BBin(n, \alpha , \beta )\) with parameters \(\alpha = 1\), \(\beta = n_0 - 1\). That is, for any \(k \ge 0\):

$$\begin{aligned} \Pr (C_n = k + 1)&= \left( {\begin{array}{c}n\\ k\end{array}}\right) \frac{B(k + 1, n + n_0 - k - 1)}{B(1, n_0 - 1)} \end{aligned}$$
(1)
$$\begin{aligned}&= (n_0 - 1) \left( {\begin{array}{c}n\\ k\end{array}}\right) B(k + 1, n + n_0 - k - 1). \end{aligned}$$
(2)

Chung et al. claimed in [6] that the distribution of \(C_n\) can be approximated by a density function \(f(x) = \exp \left( -\frac{x}{{\mathbb {E}}C_n}\right)\). Instead, here we have an exact formula.

Moreover, since \(C_n \sim BBin(n, 1, n_0 - 1) + 1\) we know immediately that \({\mathbb {E}}C_n = \frac{n}{n_0} + 1\). For further results we will also need further properties of the beta binomial distribution (with proofs provided in the appendices).

Note that all the logarithms used in subsequent theorems (unless explicitly noted as \(\ln\)) have base 2.

Lemma 2

If \(X \sim BBin(n, \alpha , \beta )\), then it is true that \({\mathbb {E}}[\log (X + 1)] = \log n + \left( \psi (\alpha ) - \psi (\alpha + \beta )\right) \log e + o(1)\)where \(\psi (x) = \frac{\Gamma '(x)}{\Gamma (x)}\)is the Euler digamma function.

Since for all integers r, s we have \(\psi (r) - \psi (s) = H_{r - 1} - H_{s - 1}\) (where \(H_j\) denotes the j-th harmonic number), it follows that

Corollary 1

\({\mathbb {E}}[\log C_n] = \log n - H_{n_0 - 1} \log e + o(1)\)for large n.

Similarly, we may prove that:

Lemma 3

If \(X \sim BBin(n, \alpha , \beta )\), then

$$\begin{aligned} {\mathbb {E}}[(X + 1) \log (X + 1)] &= n \log n \frac{\alpha }{\alpha + \beta } + n \frac{\alpha \left( \psi (\alpha + 1) - \psi (\alpha + \beta + 1)\right) \log e}{\alpha + \beta } \\ &\quad + \log n + \left( \psi (\alpha ) - \psi (\alpha + \beta ) + 1 + \frac{\beta }{2 (\alpha + \beta )}\right) \log e + o(1). \end{aligned}$$

From the above lemma it is straightforward that:

Corollary 2

Asymptotically

$$\begin{aligned} {\mathbb {E}}[C_n \log C_n]&= \frac{1}{n_0} n \log n + n \frac{(1 - H_{n_0}) \log e}{n_0} + \log n \\&\qquad + \left( \frac{3}{2} - \frac{1}{2 n_0} - H_{n_0 - 1}\right) \log e + o(1). \end{aligned}$$

3 Main Theoretical Results

As discussed in the introduction, our goal is to present results for the duplication graphs on structural parameters which are fundamental to statistical and information-theoretic problems involving the information shared between the labels and the structure of a random graph. In graph structure compression the goal is to remove label information to produce a compact description of a graph structure.

Formally, the labeled graph compression problem can be phrased as follows: one is given a probability distribution \({\mathcal {G}}_n\) on graphs on n vertices, and the task is to exhibit a pair of mappings (i.e., a source code) (ED), where E maps graphs to binary strings satisfying the standard prefix code condition, and D maps binary strings back to graphs, such that, for all graphs G, \(D(E(G)) = G\), and the expected code length \({\mathbb {E}}[|E(G)|]\), with \(G \sim {\mathcal {G}}_n\), is minimized. The standard source coding theorem tells us that the fundamental limit for this quantity is H(G), the Shannon entropy, defined as:

$$\begin{aligned} H(G) = -\sum _{G \in {\mathcal {G}}_n} P(G) \log P(G), \end{aligned}$$
(3)

where G is a functional of the distribution, not a fixed graph.

The unlabeled version of this problem relaxes the invertibility constraint on the encoder and decoder. In particular, we only require \(D(E(G)) \cong G\); i.e., the decoder only outputs a graph isomorphic to G. Again, the optimization objective is to minimize the expected code length. Thus, in effect, the source code efficiently describes the isomorphism type of its input. Denoting by S(G) the isomorphism type of G, the fundamental limit for the expected code length is the structural entropy of the model, which is given by H(S(G)).

There is a relation between the labeled entropy H(G) and structural entropy H(S(G)). To express it succinctly for a broad class of graph models we need the automorphism groupFootnote 1\({\mathrm {Aut}}(G)\), and the set \(\Gamma (G)\) of feasible permutations of G; i.e., the set of permutations of G that yield a graph that has positive probability under the random graph model in question. See [5, 17] for more details.

Now, we are ready to present a relation between H(G) and H(S(G)). The following lemma was proved in [17]:

Lemma 4

We have, for any graph model \(G_n\)in which all positive-probability labeled graphs that are isomorphic have the same probability,

$$\begin{aligned} H(G_n) - H(S(G_n)) = {\mathbb {E}}[\log |\Gamma (G_n)|] - {\mathbb {E}}[\log |{\mathrm {Aut}}(G_n)|]. \end{aligned}$$

Now we prove the following results regarding the expected logarithms of the sizes of the automorphism group and feasible permutation set for samples \(G_n\) from the full duplication model.

Lemma 5

We have

$$\begin{aligned} {\mathbb {E}}[\log |{\mathrm {Aut}}(G_n)|]&= n \log n - n H_{n_0} \log e + \frac{3 n_0}{2} \log n \\ &\quad + \left( \frac{n_0}{2} - \frac{1}{2} - \frac{3 n_0}{2} H_{n_0 - 1} \right) \log e + \frac{n_0}{2} \log (2 \pi ) + o(1) \end{aligned}$$

for large n.

Proof

Under the assumption that \(|{\mathrm {Aut}}(G_0)| = 1\) we have \({\mathbb {E}}[\log |{\mathrm {Aut}}(G_n)|] = {\mathbb {E}}\left[ \log \prod \nolimits _{i = 1}^{n_0} C_{i, n}!\right]\). To prove it, it is sufficient to notice that all vertices v, w such that \(A(v) = A(w)\) can be mapped on one another arbitrarily (since by Lemma 1 they have equal neighborhoods)—but if \(A(v) \ne A(w)\), there does not exist any automorphism \(\sigma\) for which v and w are in the same orbit. Precisely, this is because, if such a \(\sigma\) did exist, then one may show that it induces an automorphism of \(G_0\).

Thus,

$$\begin{aligned} {\mathbb {E}}&[\log |{\mathrm {Aut}}(G_n)|] = {\mathbb {E}}\left[ \log \prod _{i = 1}^{n_0} C_{i, n}!\right] = \sum _{i = 1}^{n_0} {\mathbb {E}}[\log C_{i, n}!] = n_0 {\mathbb {E}}[\log C_n!]. \end{aligned}$$

We use Stirling’s approximation together with Corollarys 1 and 2 to obtain

$$\begin{aligned} {\mathbb {E}}&[\log C_n!] = {\mathbb {E}}[C_n \log C_n] - {\mathbb {E}}C_n \log e + \frac{1}{2} {\mathbb {E}}[\log C_n] + \frac{1}{2} \log (2 \pi ) + o(1) \\&= {\mathbb {E}}[C_n \log C_n] - n \frac{\log e}{n_0} - \log e + \frac{1}{2} {\mathbb {E}}[\log C_n] + \frac{1}{2} \log (2 \pi ) + o(1) \\&= n \log n \frac{1}{n_0} - n \frac{H_{n_0} \log e}{n_0} + \frac{3}{2} \log n \\&\quad + \left( \frac{1}{2} - \frac{1}{2 n_0} - \frac{3}{2} H_{n_0 - 1}\right) \log e + \frac{1}{2} \log (2 \pi ) + o(1). \end{aligned}$$

Finally,

$$\begin{aligned} {\mathbb {E}}[\log |{\mathrm {Aut}}(G_n)|]&= n \log n - n H_{n_0} \log e + \frac{3 n_0}{2} \log n \\&\quad + \left( \frac{n_0}{2} - \frac{1}{2} - \frac{3 n_0}{2} H_{n_0 - 1} \right) \log e + \frac{n_0}{2} \log (2 \pi ) + o(1). \end{aligned}$$

The proof is completed. \(\square\)

Observe that \(G_n\) has \(n + n_0\) vertices; therefore, the trivial upper bound on \(|\Gamma (G_n)|\) is \((n + n_0)!\). We can do the exact computation of \(\Gamma (G_n)\) using the following lemma:

Lemma 6

For a permutation \(\pi\)of all vertices in \(G_n\), the following two claims are equivalent:

  1. 1.

    \(\pi\) is a relabeling of \(G_n\) which produces a positive-probability graph under the full duplication model,

  2. 2.

    \(\pi\) is a permutation such that for every \(1 \le i \le n_0\) there exists \(v \in {\mathcal {C}}_{i, n}\) such that \(\pi (v) = u_i\).

Proof

In the whole proof we denote by \(u'_1, \ldots , u'_{n_0}\) the vertices that are mapped by \(\pi\) to the starting graph vertices \(u_1, \ldots , u_{n_0}\). That is, \(u'_i = \pi ^{-1}(u_i)\) for each \(i \in \{1, 2, \ldots , n_0\}\).

(\(\Rightarrow\)) Let \(\pi\) produce a graph under the considered model with positive probability.

Suppose now that there exists \(1 \le k \le n_0\) such that \(u'_k \notin {\mathcal {C}}_{k, n}\), but \(u'_k \in {\mathcal {C}}_{l, n}\) for some \(l \ne k\). Then, by Lemma 1 we know that \(N_n(u_k) = \bigcup _{u_k u_j \in E(G_0)} {\mathcal {C}}_{j, n}\) and \(N_n(u'_k) = N_n(u_l) = \bigcup _{u_l u_j \in E(G_0)} {\mathcal {C}}_{j, n}\).

Since \(|{\mathrm {Aut}}(G_0)| = 1\) by assumption, \(N_0(u_k) \ne N_0(u_l)\) and therefore

$$\begin{aligned} N_n(u'_k) \setminus N_n(u_k)&= \bigcup _{u_l u_j \in E(G_0)} {\mathcal {C}}_{j, n} \setminus \bigcup _{u_k u_j \in E(G_0)} {\mathcal {C}}_{j, n} \\&\supseteq \bigcup _{u_l u_j \in E(G_0)} {\mathcal {C}}_{j, 0} \setminus \bigcup _{u_k u_j \in E(G_0)} {\mathcal {C}}_{j, 0} \\&= N_0(u_l) \setminus N_0(u_k) \ne \emptyset \end{aligned}$$

which proves that \(N_n(u'_k) \ne N_n(u_k)\) and therefore \(G'_0\) cannot be identical to \(G_0\).

(\(\Leftarrow\)) Denote by \(v_1'\), ..., \(v_n'\) the vertices \(\pi ^{-1}(v_1)\), ..., \(\pi ^{-1}(v_n)\); i.e., these vertices are mapped by \(\pi\) to vertices outside the seed graph.

By assumption, for every \(v'_i\), \(1 \le i \le n\), there exists some \(u'_j = \pi ^{-1}(u_j)\), \(1 \le j \le n_0\), such that \(v'_i, u'_j \in {\mathcal {C}}_{j, n}\). Now, in i-th step we may just copy \(v'_i\) from its respective \(u'_j\). It is easy to check that for the neighborhoods \(N'(v'_i)\) in the graph created in this way for every \(1 \le k \le n_0\) and every \(v'_i \in {\mathcal {C}}_{k, n}\) we have

$$\begin{aligned} N'_n(v'_i) = N'_n(u'_k) = \bigcup _{j: u_k u_j \in E(G_0)} {\mathcal {C}}_{j, n} = N_n(u_k) = N_n(v'_i), \end{aligned}$$

which concludes the proof. \(\square\)

Lemma 7

Asymptotically

$$\begin{aligned} {\mathbb {E}}[\log |\Gamma (G_n)|]&= n \log n - n \log e + \left( n_0 + \frac{1}{2}\right) \log n \\&\quad - H_{n_0 - 1} \log e + \frac{1}{2} \log (2 \pi ) + o(1). \end{aligned}$$

Proof

From Lemma 6, we may construct all admissible permutations by choosing for each \({\mathcal {C}}_{i, n}\) exactly one vertex which would be mapped to \(u_i\) and then arranging remaining n vertices in any order. Therefore:

$$\begin{aligned} |\Gamma (G_n)| = n! \prod _{i = 1}^{n_0} \left( {\begin{array}{c}C_{i, n}\\ 1\end{array}}\right) = n! \prod _{i = 1}^{n_0} C_{i, n}. \end{aligned}$$

Then

$$\begin{aligned} {\mathbb {E}}[\log |\Gamma (G_n)|]&= \log n! + \sum _{i = 1}^{n_0} {\mathbb {E}}[\log C_{i, n}] = \log n! + n_0 {\mathbb {E}}[\log C_n] \\&= \log n! + n_0 \log n - H_{n_0 - 1} \log e + o(1), \end{aligned}$$

and the final result follows from the Stirling approximation. \(\square\)

We now proceed to estimate the structural entropy.

Theorem 1

For large n we have

$$\begin{aligned} H(S(G_n)~|~G_0) = (n_0 - 1) \log n - \log (n_0 - 1)! + o(1). \end{aligned}$$

Proof

Recalling that we assume throughout that the initial graph \(G_0\) is asymmetric, it may be seen that the isomorphism type of \(G_n\) is entirely specified by the vector \((C_{i,n})^{n_0}_{i=1}\). We know that \((C_{i, n})_{i = 1}^{n_0}\) has the Dirichlet multinomial distribution with \(\alpha _i = 1\) for \(1 \le i \le n_0\).

Therefore

$$\begin{aligned} H&(S(G_n)~|~G_0) = H\left( (C_{i, n})_{i = 1}^{n_0}\right) \\&= - \sum _{(k_i)} \Pr ((C_{i, n})_{i = 1}^{n_0} = (k_i + 1)_{i=1}^{n_0}) \log \Pr ((C_{i, n})_{i = 1}^{n_0} = (k_i + 1)_{i=1}^{n_0}) \\&= - \log (n B(n, n_0)) \sum _{(k_i)} \Pr ((C_{i, n})_{i = 1}^{n_0} = (k_i + 1)_{i=1}^{n_0}) \\&= - \log n - \log B(n, n_0) = (n_0 - 1) \log n - \log (n_0 - 1)! + o(1). \end{aligned}$$

The last two lines follow respectively from the Stirling approximation and the Taylor expansion of \(\log B(n, n_0)\), which completes the proof. \(\square\)

To compute the graph entropy H(G) we can use Lemmas 4, 5 and 7 together with Theorem 1, therefore obtaining the following result.

Theorem 2

For large n

$$\begin{aligned} H(G_n~|~G_0)&= n (H_{n_0} - 1) \log e + \log n \,\frac{n_0 - 1}{2} - \log (n_0 - 1)! \\&\quad + \left( \frac{1 - n_0}{2} + \frac{3 n_0 - 2}{2} H_{n_0 - 1}\right) \log e + \frac{n_0}{2} \log (2 \pi ) + o(1). \end{aligned}$$

Clearly, to compress the whole \(G_n\) we would have to encode \(G_0\) as well, but since \(n_0\) is fixed, this does only affect the constant term. Moreover, by the conditional entropy property, any optimal \(G_0\) compression algorithm yields an asymptotically optimal compression for \(G_n\).

4 Algorithmic Results

In this section we present asymptotically optimal algorithms for compression of labeled and generated according to the full duplication model.

4.1 Retrieval of Parameters from \(G_n\)

In order to present efficient compression algorithms for the duplication model, we must first reconstruct \(G_0\) from \(G_n\) and find values of \(n_0\) and n. This is relatively easy to accomplish, as the proof of the next theorem shows.

Theorem 3

For a given labeled \(G_n\)or its unlabeled version \(S(G_n)\), we can retrieve its n, \(n_0\)and \(G_0\) (in the case of structure up to isomorphisms of \(G_0\)) in polynomial times in terms of n.

Proof

For a labeled \(G_n\) let \((w_1, w_2, \ldots , w_{n + n_0})\) be its vertices in the order of appearance. Since \((w_1, \ldots , w_{n_0}) = (u_1, \ldots , u_{n_0})\) and \((w_{n_0+1}, \ldots , w_{n_0 + n}) = (v_1, \ldots , v_{n})\), it is sufficient to find the smallest k such that \(N_n(w_k) = N_n(w_i)\) for some \(1 \le i < k\). Then \(n_0 = k - 1\) and \(G_0\) is induced by the sequence \((w_1, \ldots , w_{k - 1})\).

The case for is similar: we know (for details see Lemma 6) that the sequence of the first \(n_0\) vertices of the graph (that is, \(G_0\)) contains exactly one vertex from each set \({\mathcal {C}}_{i, n}\).

From Lemma 1 it follows that \(A(v) = A(w)\) iff \(N_n(v) = N_n(w)\) for every \(v, w \in V(G_n)\), so it is sufficient to scan all vertices of \(G_n\) and split them into sets such that v and w belongs to the same set iff \(N_n(v) = N_n(w)\). Then, we pick one vertex from each set to from \(G_0\). Obviously, \(n_0\) and n may be extracted from the sizes of \(G_0\) and \(G_n\). \(\square\)

Recall for example that in Fig. 1b we identified the clusters \(\{u_1, v_2\}\), \(\{u_2, v_1, v_3\}\), \(\{u_3\}\), \(\{u_4\}\) and \(\{u_5\}\). Therefore, we know that \(n_0 = 6\), \(n = 3\) and the \(G_0\) is isomorphic to a graph induced, for example, by the set \(\{v_2, v_3, u_3, u_4, u_5\}\).

4.2 Unlabeled Graphs

A trivial algorithm CompressUnlabeledSimple for unlabeled compression writes down a sequence \((C_{i, n})_{i = 1}^{n_0}\) associated with our \(G_n\) as \(\log n\)-bit numbers. This always requires \(n_0 \log n\) bits, so \({\mathbb {E}}L_{SU}(n) = n_0 \log n\), where \(L_{SU}\) denotes the code length of our proposed scheme. By Theorem 1 this achieves the fundamental limit to within a multiplicative factor of \(1 + \frac{1}{n_0 - 1}\).

However, it is easy to design an optimal algorithm up to a constant additive error, provided we have already compressed \(G_0\) or \(S(G_0)\) (anyway, a graph of fixed size). The pseudocode of an optimal algorithm, called CompressUnlabeledOpt, based on arithmetic coding, is as follows:

figure a

The next finding proves that CompressUnlabeledOpt is nearly optimal.

Theorem 4

Algorithm CompressUnlabeledOpt is optimal up to a two bits for unlabeled graphs compression, when the graph is generated by the full duplication model.

Proof

It is sufficient to observe that

$$\begin{aligned}&\Pr ((C_{i, n})_{i = 1}^{n_0} = (k_i + 1)_{i = 1}^{n_0}) \\&= \Pr ((C_{i, n})_{i = 1}^{n_0} = (k_i + 1)_{i = 1}^{n_0}~|~C_{n_0, n} = k_{n_0} + 1) \Pr (C_{n_0, n} = k_{n_0} + 1) \\&= \Pr ((C_{i, n})_{i = 1}^{n_0 - 1} = (k_i + 1)_{i = 1}^{n_0 - 1}~|~C_{n_0, n} = k_{n_0} + 1) \Pr (C_{n_0, n} = k_{n_0} + 1) \\&= \Pr \left( (C_{i, n})_{i = 1}^{n_0 - 1} = (k_i + 1)_{i = 1}^{n_0 - 1} \Big \vert \sum _{i = 0}^{n_0 - 1} C_{i, n} = n + n_0 - k_{n_0} - 1\right) \\&\qquad \Pr (C_{n_0, n} = k_{n_0} + 1) \\&= \Pr ((C_{i, n - k_{n_0}})_{i = 1}^{n_0 - 1} = (k_i + 1)_{i = 1}^{n_0 - 1}) \\&\qquad \left( {\begin{array}{c}n\\ k_{n_0}\end{array}}\right) (n_0 - 1) B(k_{n_0} + 1, n + n_0 - k_{n_0} - 1), \end{aligned}$$

The last equality follows from the fact that the marginal distribution of the Dirichlet multinomial distribution is the beta-binomial distribution, given by Eq. 1. Moreover, if we fix value of the last coordinate of \((C_{i, n})_{i = 1}^{n_0}\) to \(k + 1\), then the resulting distribution is also (shifted) Dirichlet multinomial, but with \(n_0 - 1\) coordinates and all values summing up to \(n + n_0 - k - 1\).

We repeat this process until we have 2-dimensional distribution:

$$\begin{aligned} \Pr&((C_{i, n})_{i = 1}^{2} = (k_i + 1)_{i = 1}^{2}) \\&= \Pr (C_{1, n} = k_1 + 1~|~C_{1, n} = k_1 + 1) \Pr (C_{2, n} = k_1 + 1) \\&= \left( {\begin{array}{c}k_1 + k_2\\ k_2\end{array}}\right) B(k_2 + 1, k_1 + 1). \end{aligned}$$

By the properties of arithmetic coding (see e.g. [7]), \({\mathbb {E}}L_O(S(G_n)~|~G_0) \le H((C_{i, n})_{i = 0}^{n_0}) + 2 = H(S(G_n)~|~G_0) + 2\), where \(L_O\) denotes the code length. This completes the proof. \(\square\)

4.3 Labeled Graphs

We note that the labeled graph \(G_n\) is equivalent to a sequence \((A(v_i))_{i = 1}^n\) for a given (labeled) \(G_0\), which obviously can be encoded separately using a constant number of bits.

A trivial algorithm CompressLabeledSimple just writes all \(A(v_i)\) as \(\log n_0\)-bit numbers. Clearly, this always gives us a codeword with length exactly \({\mathbb {E}}L_{SL}(n) = n \log n_0\). From Theorem 2 it is known that this algorithm is asymptotically \((1 + \frac{1 - \gamma }{\log n_0})\)-approximately optimal, where \(\gamma\) is Euler-Mascheroni constant.

It is easy to design an asymptotically optimal algorithm up to a constant error. Indeed, the sequence of \(A(v_i)\) is random with \(\Pr (A(v_i) = u_j) = \frac{C_{j, i - 1}}{n_0 + i - 1}\) for \(1 \le i \le n\), \(1 \le j \le n_0\). Therefore, given \(G_{i - 1}\) we know the conditional probabilities of \(G_i\) and we may construct another algorithm based on arithmetic coding.

The pseudocode of the optimal algorithm is as follows:

figure b

The next theorem proves that CompressLabeledOpt is almost optimal up to a known additive constant.

Theorem 5

Algorithm CompressLabeledOpt is optimal up to a two bits for labeled graph compression, when the graph is generated by the full duplication model.

Proof

By the well-known properties of arithmetic encoding (see [7]), we know that \({\mathbb {E}}L_O(G_n~|~G_0) \le H(G_n~|~G_0) + 2\), where \(L_O\) denotes the code length. \(\square\)

Note that these two algorithms for the labeled graphs differ only in that the optimal one updates the probabilities at each step and the second fixes them to a constant value of \(\frac{1}{n_0}\).