Appendix
MLE method
We present the MLE of \({\varvec{\omega }}\) only for graph sampling method UNI, because it is not easy to derive the MLE of \({\varvec{\omega }}\) for other graph sampling methods such as RW, MHRW, and FS. Suppose that the graph size is known (this can be estimated by sampling methods proposed in [22]), \(n < |V|\) vertices are sampled and then each copy of \(\mathbf {c}\) is sampled with the same probability \(p=\frac{n}{|V|}\). For simplicity, we assume that content is distributed over networks uniformly at random. Let M be the maximum number of copies that content has. Denote \(P_{i,j}\) as the probability of sampling i copies for content totally having j copies, where \( 1 \le i \le j \le M\). Let \(q=1-p\), we have \(P_{i,j}=\frac{\left( {\begin{array}{c}j\\ i\end{array}}\right) p^i q^{j-i}}{1 - q^j}\). We compute the MLE of \({\varvec{\omega }}\) from sampled content copies in respect to the following two cases:
Case 1 When the content label under study is the number of copies associated with content. For randomly sampled content, let \(\alpha _i\) (\(1\le i\le M\)) be the probability that it has i copies sampled. Among sampled content, let \(x_i\) be the fraction of content with i copies sampled. We have \(\mathbb {E}(x_i) = \alpha _i\). Thus, \(x_i\) is an unbiased estimate of \(\alpha _i\). Next, we present a method to estimate \({\varvec{\omega }}\) based on the relationship of \(\alpha _i\) and \({\varvec{\omega }}\). The likelihood function of \(\alpha _i\) is
$$\begin{aligned} \alpha _{i}=\sum _{j=i}^{M} \omega _j P_{i,j}. \end{aligned}$$
(5)
This is similar to packet sampling-based flow size distribution estimation studied in [12], where each packet is sampled with probability p. Here a flow refers to a group of packets with the same source and destination, and the flow size is the number of packets that it contains. In our context, content corresponds to a flow, and its copies correspond to packets in the flow. Therefore, we can develop a maximum likelihood estimate \(\hat{\omega }_k^{\text {MLE}}\) of \(\omega _k\) (\(1\le k\le M\)) similar to the method proposed in [12].
Case 2 When the content label under study is independent with the number of duplicates and it is available in each content copy, which is not a latent property such as the number of copies content has, we use the following approach to derive the MLE. Define \(\beta _{k,j}\) (\(0\le k\le K\), \(1\le j\le M\)) as the fraction of the number of content with label \(l_k\) and j copies over the number of content with label \(l_k\). For sampled content, let \(\alpha _{k,i}\) (\(1\le i\le M\)) be the probability that its content label is \(l_k\) and has i copies sampled. Then, the likelihood function of \(\alpha _{k,i}\) is
$$\begin{aligned} \alpha _{k,i}=\sum _{j=i}^M \beta _{k,j} P_{i,j}. \end{aligned}$$
\(\alpha _{k,i}\) can be estimated based on sampled content copies. That is, among sampled content, let \(x_{k,i}\) be the fraction of content with label \(l_k\) that has i copies sampled. We have \(\mathbb {E}(x_{k,i}) = \alpha _{k,i}\). Therefore, \(x_{k,i}\) is an unbiased estimate of \(\alpha _{k,i}\). Similar to (5), we then develop a maximum likelihood estimate \(\hat{\beta }_{k,j}\) of \(\beta _{k,j}\), \(1\le j\le M\). Since
$$\begin{aligned} \alpha _k=\omega _k \sum _{i=1}^M \sum _{j=i}^M \beta _{k,j} P_{i,j}, \end{aligned}$$
we have the following estimator of \(\omega _k\)
$$\begin{aligned} \hat{\omega }_k^{\text {MLE}}=\frac{\hat{\alpha }_k}{S^{\text {MLE}}\sum _{i=1}^M \sum _{j=i}^M \hat{\beta }_{k,j} P_{i,j}}, \quad 0\le k\le K, \end{aligned}$$
where \(\hat{\alpha }_k\) is the fraction of sampled content with label \(l_k\), and
$$\begin{aligned} S^{\text {MLE}}=\sum _{k=0}^K \frac{\hat{\alpha }_k}{\sum _{i=1}^M \sum _{j=i}^M \hat{\beta }_{k,j} P_{i,j}}. \end{aligned}$$