Abstract
Consider the twin problems of estimating the connection probability matrix of an inhomogeneous random graph and the graphon of a W-random graph. We establish the minimax estimation rates with respect to the cut metric for classes of block constant matrices and step function graphons. Surprisingly, our results imply that, from the minimax point of view, the raw data, that is, the adjacency matrix of the observed graph, is already optimal and more involved procedures cannot improve the convergence rates for this metric. This phenomenon contrasts with optimal rates of convergence with respect to other classical distances for graphons such as the \(l_1\) or \(l_2\) metrics.
Similar content being viewed by others
References
Airoldi, E.M., Costa, T.B., Chan, S.H.: Stochastic blockmodel approximation of a graphon: theory and consistent estimation. In: Advances in Neural Information Processing Systems, pp. 692–700 (2013)
Alon, N., De La Vega, W.F., Kannan, R., Karpinski, M.: Random sampling and approximation of MAX-CSPS. J. Comput. Syst. Sci. 67(2), 212–243 (2003)
Bandeira, A.S., van Handel, R.: Sharp nonasymptotic bounds on the norm of random matrices with independent entries. Ann. Probab. 44(4), 2479–2506 (2016)
Bickel, P.J., Chen, A.: A nonparametric view of network models and Newman–Girvan and other modularities. Proc. Natl. Acad. Sci. 106(50), 21068–21073 (2009)
Bickel, P.J., Chen, A., Levina, E.: The method of moments and degree distributions for network models. Ann. Stat. 39(5), 2280–2301 (2011)
Bollobás, B., Janson, S., Riordan, O.: The phase transition in inhomogeneous random graphs. Random Struct. Algorithms 31(1), 3–122 (2007)
Borgs, C., Chayes, J.T., Cohn, H., Ganguly, S.: Consistent nonparametric estimation for heavy-tailed sparse graphs. ArXiv e-prints (2015)
Borgs, C., Chayes, J.T., Lovász, L., Sós, V.T., Vesztergombi, K.: Convergent sequences of dense graphs. I. Subgraph frequencies, metric properties and testing. Adv. Math. 219(6), 1801–1851 (2008)
Borgs, C., Chayes, J.T., Lovász, L., Sós, V.T., Vesztergombi, K.: Convergent sequences of dense graphs II. Multiway cuts and statistical physics. Ann. Math. (2) 176(1), 151–219 (2012)
Borgs, C., Chayes, J., Smith, A.: Private graphon estimation for sparse graphs. In: Advances in Neural Information Processing Systems, pp. 1369–1377 (2015)
Borgs, C., Chayes, J.T., Cohn, H., Zhao, Y.: An LP theory of sparse graph convergence I: limits, sparse random graph models, and power law distributions. (2014). arXiv preprint arXiv:1401.2906
Borgs, C., Chayes, J.T., Cohn, H., Zhao, Y.: An LP theory of sparse graph convergence. II: LD convergence, quotients, and right convergence. (2014). arXiv preprint arXiv:1408.0744
Boucheron, S., Lugosi, G., Massart, P.: Concentration inequalities using the entropy method. Ann. Probab. 31(3), 1583–1614 (2003)
Cai, D., Ackerman, N., Freer, C.: An iterative step-function estimator for graphons. (2014). arXiv preprint arXiv:1412.2129
Chan, S.H., Airoldi, E.M.: A consistent histogram estimator for exchangeable graph models. In: Proceedings of the 31st International Conference on Machine Learning, pp. 208–216 (2014)
Chatterjee, S.: Matrix estimation by universal singular value thresholding. Ann. Stat. 43(1), 177–214 (2015)
Choi, D.S., Wolfe, P.J., Airoldi, E.M.: Stochastic blockmodels with a growing number of classes. Biometrika 99(2), 273–284 (2012)
Diaconis, P., Janson, S.: Graph limits and exchangeable random graphs. (2007). arXiv preprint arXiv:0712.2749
Frieze, A., Kannan, R.: Quick approximation to matrices and applications. Combinatorica 19(2), 175–220 (1999)
Gao, C., Lu, Y., Zhou, H.H.: Rate-optimal graphon estimation. Ann. Stat. 43(6), 2624–2652 (2015)
Guédon, O., Vershynin, R.: Community detection in sparse networks via Grothendieck’s inequality. Probab. Theory Relat. Fields 165(3), 1025–1049 (2016)
Hoff, P.D., Raftery, A.E., Handcock, M.S.: Latent space approaches to social network analysis. J. Am. Stat. Assoc. 97(460), 1090–1098 (2002)
Holland, P.W., Laskey, K.B., Leinhardt, S.: Stochastic blockmodels: first steps. Soc. Netw. 5(2), 109–137 (1983)
Janson, S.: Graphons, Cut Norm and Distance, Couplings and Rearrangements, Volume 4 of New York. Journal of Mathematics. NYJM Monographs. State University of New York, University at Albany, Albany (2013)
Klopp, O.: Rank penalized estimators for high-dimensional matrices. Electron. J. Stat. 5, 1161–1183 (2011)
Klopp, O., Tsybakov, A.B., Verzelen, N.: Oracle inequalities for network models and sparse graphon estimation. Ann. Stat. 45(1), 316–354 (2017)
Latouche, P., Robin, S.: Bayesian Model Averaging of Stochastic Block Models to Estimate the Graphon Function and Motif Frequencies in a W-Graph Model. Technical report (2013)
Lovász, L.: Large Networks and Graph Limits, Volume 60 of American Mathematical Society Colloquium Publications. American Mathematical Society, Providence (2012)
Lovász, L., Szegedy, B.: Limits of dense graph sequences. J. Combin. Theory Ser. B 96(6), 933–957 (2006)
Szarek, S.J.: On the best constants in the Khinchin inequality. Stud. Math. 58(2), 197–208 (1976)
Szemerédi, E.: Regular Partitions of Graphs. Technical report, DTIC Document (1975)
Tsybakov, A.B.: Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, New York (2009). Revised and extended from the 2004 French original, Translated by Vladimir Zaiats
Wolfe, P.J., Olhede, S.C.: Nonparametric Graphon Estimation. (2013). arXiv preprint arXiv:1309.5936
Yang, J., Han, C., Airoldi, E.M.: Nonparametric estimation and testing of exchangeable graph models. In: AISTATS (2014)
Zhang, Y., Levina, E., Zhu, J.: Estimating Network Edge Probabilities by Neighborhood Smoothing. (2015). arXiv preprint arXiv:1509.08588
Author information
Authors and Affiliations
Corresponding author
Appendices
A Proof methods
In this section, we summarize some basic facts and fundamental results that we use in the proofs.
1.1 A.1 Non-symmetric kernels
At some point, we will need to work with non-symmetric kernels and with kernel defined on general measurable subsets of \(\mathbb {R}\). In this section we define the corresponding spaces. Let \(\mathcal {X}\) and \(\mathcal {Y}\) denote two bounded measurable subsets of \(\mathbb {R}\). Then, \(\mathcal {W}_{\mathcal {X},\mathcal {Y}}\) refers to the collection of bounded measurable functions \(W{:}\,\mathcal {X}\times \mathcal {Y}\rightarrow [-\,1,1]\). We will denote by \(\mathcal {W}^{+}_{\mathcal {X},\mathcal {Y}}\) the collection of bounded measurable and non-negative functions \(W{:}\,\mathcal {X}\times \mathcal {Y}\rightarrow [0,1]\). Let \(\mathcal {W}_{\mathcal {X},\mathcal {Y}}[k]\) be the collection of \(k-\)step kernels, that is, the subset of kernels \(W\in \mathcal {W}_{\mathcal {X},\mathcal {Y}}\) such that for some \(\varvec{Q}\in \mathbb {R}^{k\times k}\) and some \(\phi _1{:}\,\mathcal {X}\rightarrow [k]\), \(\phi _2{:}\,\mathcal {X}\rightarrow [k]\),
A kernel W is also said to be a \(q_1\times q_2\)-step function when it decomposes as in (32) but where \(\varvec{Q}\) is a size \(q_1\times q_2\) matrix, \(\phi _1\) mapping \(\mathcal {X}\) to \([q_1]\), and \(\phi _2\) mapping \(\mathcal {Y}\) to \([q_2]\). The cut norm can be readily extended to kernels \(W\in \mathcal {W}_{\mathcal {X},\mathcal {Y}}\) in the following way:
where the supremum is taken over all measurable subsets X and Y.
1.2 A.2 Concentration inequalities
In the proofs we repeatedly use Bernstein’s inequality. We state it here for the readers’ convenience. Let \(X_1,\dots ,X_N\) be independent zero-mean random variables. Suppose that \(|X_i|\le M\) almost surely, for all i. Then, for any \(t>0\),
We shall also rely on the bounded difference inequality (also called McDiarmid’s inequality).
Lemma 1
(Bounded difference inequality) Let \(X_1,\ldots , X_n\) denote n independent real random variables. Assume that \(g{:}\,\mathbb {R}^n \rightarrow \mathbb {R}\) is a measurable function satisfying, for some positive constants \((c_i)_{1\le i\le n }\), the bounded difference condition
for all \(x=(x_1,\ldots , x_i,\ldots , x_n)\in \mathbb {R}^n\), \(x'=(x_1,\ldots , x'_i,\ldots , x_n)\in \mathbb {R}^n\) and all \(i\in [n]\). Then, the random variable \(Z=g(X_1,\ldots , X_n)\) satisfies
for all \(t>0\).
1.3 A.3 Fano’s lemma
In the sequel, \({\mathcal {K}}{\mathcal {L}}(\cdot ,\cdot )\) denotes the Kullback–Leibler divergence between two distributions. In this manuscript, all the proofs of the minimax lower bounds rely on Fano’s method. The following version of Fano’s lemma is borrowed from [32]:
Lemma 2
[32, Theorem 2.7] Consider a parametric model \({\mathbb {P}}_{\theta }\), with \(\theta \in \Theta \) and a metric \(d(\cdot ,\cdot )\) on \(\Theta \). Assume that \(\Theta \) contains elements \(\theta _1, \ldots , \theta _M\), \(M\ge 3\), such that for all \(j,k\in [M]\) with \(j\ne k\)
-
(i)
\(d(\theta _j,\theta _k)\ge s>0,\)
-
(ii)
\(\mathcal {KL}(\mathbb {P}_{\theta _j},{\mathbb {P}}_{\theta _k})\le \log (M)/32\).
Then, we have
where the constant \(C>0\) is numeric.
1.4 A.4 Khintchine’s inequality
Next, we state a particular case of Khintchine’s inequality that turns out to be useful for bounding the cut norm of step kernels in terms of their \(l_1\) norm:
Lemma 3
[30] Let \(\epsilon _1,\ldots ,\epsilon _p\) be i.i.d. Rademacher random variables and let \(x_1,\ldots , x_p\) be some real numbers. Then,
We use this result to prove the following lower bound on the cut norm of step kernels:
Lemma 4
Let \(U{:}\,\mathcal {X}\times \mathcal {Y}\mapsto [-1,1]\) denote a measurable \(q_1\times q_2\)-step function. Then,
Proof of Lemma 4
There exist partitions \(\mathcal {X}= \mathcal {X}_1\cup \ldots \mathcal {X}_{q_1}\) and \(\mathcal {Y}= \mathcal {Y}_1\cup \ldots \mathcal {Y}_{q_2}\) such that, for any fixed \(y \in \mathcal {Y}\), U(x, y) is constant over \(x\in \mathcal {X}_i\) for all \(i\in [q_1]\) and, for any fixed \(y \in \mathcal {X}\), U(x, y) is constant over \(y\in \mathcal {Y}_i\) for all \(i\in [q_2]\). For any \(a\in [q_1]\) (resp. \(b\in [q_2]\)), denote \(x_a\) (resp. \(y_b\)) any element of \(\mathcal {X}_a\) (resp. \(\mathcal {Y}_b\)). By definition of \(\Vert U\Vert _{\square }\),
where we used in the last line that the value of the sum only depends on S and T through the quantities \(\lambda (S\cap \mathcal {X}_a)\) and \(\lambda (T\cap \mathcal {Y}_b)\). Since the maximum of a linear function on a convex set is achieved at an extremal point, it follows that
where we use (8) and take \(\epsilon _{a}={{\mathrm{sign}}}\sum _{ b\in [q_2]} \epsilon '_b \lambda (\mathcal {Y}_b)U[x_a,y_b]\). Let \(v=(v_1,\ldots , v_{q_2})\) denote i.i.d. Rademacher random variables and let \({\mathbb {E}}_{v}[\cdot ]\) denotes the expectation with respect to v. Now, Khintchine’s inequality (35) and Cauchy–Schwarz inequality imply
\(\square \)
B Proof of Proposition 2
Since the diagonals of \(\varvec{A}\) and \(\varvec{\Theta }\) are both zero, it suffices to control the supremum over disjoints subsets S and T (see, e.g., [8])
Let S and T be any two disjoint subsets of [n]. Using Bernstein’s inequality (34) we have that
Now, using that the number of disjoint pairs (S, T) is \(3^n\) and the union bound, we get that the probability that \(|\sum _{i\in S,j\in T}\varvec{A}_{ij}-\varvec{\Theta }_{ij}|\) exceeds \(3\sqrt{\left( \Vert \varvec{\Theta }_0\Vert _{1}+n\right) n}\) for some (S, T) is bounded by \(2\exp (-n)\). Hence, we have
with probability \(1-2e^{-n}\). Now bounding the distance by 1 in the exceptional case we get the statement of Proposition 2.
C Proof of Proposition 3
Fix \(\rho _n\in (0,1)\). This proof is based on Fano’s method. To apply Fano’s Lemma (Lemma 2), it is enough to check that there exists a finite subset \(\Omega \) of \(\mathcal {T}[2,\rho _{n}]\) such that for any two distinct \(\varvec{\Theta },\varvec{\Theta }'\) in \(\Omega \) we have
-
(a)
\(\Vert \varvec{\Theta }-\varvec{\Theta }'\Vert _{\square }\ge C\,\sqrt{\rho _{n}}\left( \frac{1}{\sqrt{n}}\wedge \sqrt{\rho _n}\right) \) and
-
(b)
\(\mathcal {KL}(\mathbb {P}_{\varvec{\Theta }},\mathbb {P}_{\varvec{\Theta }'})\le \log (|\Omega |)/32\,\)
for some constants \(C>0\). Then, Applying Lemma 2 to \(\Omega \) leads to the desired result. It remains to prove the existence of \(\Omega \). As it is classical for this kind of proof, we first build a collection \(\Omega '\subset \mathcal {T}[2,\rho _n]\) and then extract a maximal subset \(\Omega \subset \Omega '\) satisfying (a). Then, we control the Kullback divergence between any two probability to show (b).
\(\underline{\hbox {Construction of }\Omega '}\). Fix \(\epsilon \in (0,\rho _n/4)\). For any \(u\in \{-1,1\}^n\), define \(\varvec{\Theta }_u\) by \((\varvec{\Theta }_u)_{i,j}= \rho _{n}/2+u(i)u(j)\epsilon \) where \(u=\left( u(1),\dots ,u(n)\right) \). In other words, the entries \(\varvec{\Theta }_u\) are equal to \(\rho _n/2+\epsilon \) if \(u(i)u(j)=1\) and \(\rho _n/2-\epsilon \) if \(u(i)u(j)=-1\). Obviously, the collection \(\Omega ':= \left\{ \varvec{\Theta }_u{:}\,u\in \{-1,1\}^n\right\} \) is included in \(\mathcal {T}[2,\rho _n]\).
Computation of the cut distances and extraction of a maximal subset. Given \(u\in \{-1,1\}^n\), denote \(V_{u}:=\{i\in [n]{:}\,u(i)=1\}\) the set of indices corresponding to \(u(i)=1\) and \(\bar{V}_u\) its complement. Then, given two vector u and v, we define \(S:=V_{u}{\setminus } V_{v}\) and \(T := V_{v}\cap V_{u}\), we easily obtain
By symmetry, we derive that
where \(A\triangle B\) is the symmetric difference of A and B. As a consequence, the cut distance between any two graphons is large as long as the symmetric difference between u and v is both bounded away from zero and from n.
By Varshamov–Gilbert combinatorial bound (see, e.g., [32, Lemma 2.9]), we can in fact pick \(u_1,\dots ,u_N\) satisfying
with \(N\ge \exp (c_1n)\) for some \(c_1>0\). In the sequel, we consider \(\Omega =\{\varvec{\Theta }_{u_i}{:}\,i=1,\dots ,N\}\). Hence, we have \(\log \vert \Omega \vert \ge c_1n\), whereas the previous inequalities ensure that
which proves (a) when one takes \(\epsilon \) as defined in (37) below.
Control of the Kullback Divergence. To prove (b) we use the definition of Kullback–Leibler divergence \(\mathcal {KL}({\mathbb {P}}_{\varvec{\Theta }_{u}},{\mathbb {P}}_{\varvec{\Theta }_{v}})\) and \(\log x\le x-1\) for \(x>0\) to get
Now, \((\varvec{\Theta }_v)_{i,j}\ge \rho _n/4\) and \(\rho _n\le 1\) imply
Taking
with a constant \(c_2>0\) small enough, we derive from the lower bound \(\log (|\Omega |)\ge c_1 n\) that
which proves (b).
D Proof of Proposition 4
Set \(\varvec{E}=\varvec{A}-\varvec{\Theta }_0\). We have the following simple proposition (see Theorem 5 in [25])
Proposition 9
If \(\lambda \ge \Vert \varvec{E}\Vert _{2\rightarrow 2}\), then
In view of Proposition 9 we need to estimate \(\Vert \varvec{E}\Vert \) with high probability in order to specify the value of the regularization parameter \(\lambda \). Let \(\varvec{E}^{*}=(\varvec{E}^{*}_{ij})\) be such that \(\varvec{E}^{*}_{ij}=\varvec{E}_{ij}\) for \(i<j\) and \(\varvec{E}^{*}_{ij}=0\) for \(i\ge j\). Then \(\Vert \varvec{E}\Vert _{2\rightarrow 2}\le 2\Vert \varvec{E}^{*} \Vert \). We can upper bound \(\Vert \varvec{E}^{*} \Vert \) using the following bound on the spectral norm of random matrices from [3]:
Proposition 10
Let \(\varvec{W}\) be the \(n\times m\) rectangular matrix whose entries \(\varvec{W}_{ij}\) are independent centered random variables bounded (in absolute value) by some \(\sigma _*>0\). Then, for any \(0<\epsilon \le 1/2\) there exists a universal constant \(c_{\epsilon }\) such that, for every \(t\ge 0\)
where we have defined
For \(\varvec{E}^{*}\), we have \(\sigma _1\le \sqrt{\rho _nn}\), \(\sigma _2\le \sqrt{\rho _nn}\), and \(\sigma _{*}\le 1\). Taking \(\epsilon =1/2\) and \(t=\sqrt{2c_{\epsilon }\log (n)}\) in Proposition 10, we obtain that there exists absolute constants \(c^{*}\) such that
with probability at least \(1-1/n\). Since \(\rho _n\ge \log (n)/n\), we can take \(\lambda =c\sqrt{\rho _nn}\) where \(c\ge 12\sqrt{2}+4c^*\) so that \(\left\| \varvec{E}\right\| _{2\rightarrow 2}\le \lambda /2\). Then, Proposition 9 implies
It is easy to see that the cut-norm of a matrix can be bounded by its spectral norm:
Bound on the cut-norm (15) then follows from
In order to prove the Frobenius bound (14), we use the argument from [25]: we can equivalently write the singular value hard thresholding estimator as the solution to the following optimization problem:
which implies that, with probability larger than \(1-1/n\),
where we used in the last line that \(\Vert \varvec{E}\Vert _{2\rightarrow 2}\le \lambda /2\). Since \({{\mathrm{rank}}}(\varvec{\Theta }_0)\le k\), we have proved (14).
E Proof of Theorem 1
Note that both \(f_0=\rho _n W_0\) and \(\tilde{f}_{\varvec{\Theta }_0}\) are proportional to \(\rho _n\), so without loss of generality we can assume that \(\rho _n=1\). For \(k\ge n/2\), the result is a straightforward consequence of the second Sampling Lemma for Graphons of [28] stated in Proposition 1. Given any graphon \(W_0\in \mathcal {W}^+[k]\), one can always divide some of the steps into smaller steps in such a way that \(W_0\) is a 2k-step graphon whose weights are all less than or equal to 1 / k. Thus, we only need to prove the results for all graphons \(W_0\in \mathcal {W}^+[k]\) with \(32\le k\le n\) and such that its weights are all smaller or equal to 2 / k.
Let \({\varvec{\Theta }}_0'\) be the matrix with entries \(({\varvec{\Theta }}_0')_{ij}=W(\xi _i,\xi _j)\) for all i, j. As opposed to \(\varvec{\Theta }_0\), the diagonal entries of \({\varvec{\Theta }}_0'\) are not constrained to be null. By the triangle inequality, we have
As the entries of \(\varvec{\Theta }_0\) coincide with those of \({\varvec{\Theta }}_0'\) outside the diagonal, the difference \(\widetilde{f}_{\varvec{\Theta }_0}- \widetilde{f}_{{\varvec{\Theta }}_0'}\) is null outside of a set of measure 1 / n. Since \(\Vert W_0\Vert _{\infty }\le 1\), \({\mathbb {E}}[\delta _{\square }(\widetilde{f}_{\varvec{\Theta }_0}, \widetilde{f}_{{\varvec{\Theta }}_0'})]\le 1/n\). Thus, we only need to prove that
We first need to build two suitable representations of \(W_0\) and \(\widetilde{f}_{{\varvec{\Theta }}_0'}\) in the quotient space \(\widetilde{\mathcal {W}}^+\).
As a first idea, one may want to define a representation \(\widehat{W}\) of \(\widetilde{f}_{{\varvec{\Theta }}_0'}\) that matches \(W_0\) on the largest possible (with respect to the Lebesgue measure) Borel set. In fact, one can match the two representations everywhere except on a Borel set of measure of the order of \(\sqrt{k/n}\). This turns out to lead to a suboptimal bound of the order of \(\sqrt{k/n}\). In order to recover the correct logarithmic term, we refine the argument by showing that, for a suitable representation, the difference \(\widehat{W}-W_0\), when non-zero, is well approximated in cut distance by a \(\lfloor \sqrt{k}\rfloor \)-step function which is zero except on a Borel set of measure much smaller than \(\sqrt{k/n(\log (n)}\). To prepare the proof, we carefully build the representations of \(W_0\) and \(\widetilde{f}_{{\varvec{\Theta }}_0'}\).
Step 1: Construction of a suitable representation Wof\(W_0\)in\(\widetilde{\mathcal {W}}^+\)
In the sequel, we denote \(q_1:=\lfloor \sqrt{k}\rfloor \). Here, we want to choose W in such a way that a distortion of W is well approximated in the cut norm by a \(q_1\)-step kernel. We use the following lemma which is based on a variation of Szemerédi’s lemma. Let \(\varvec{Q}_0\in \mathbb {R}^{k\times k}_{\text {sym}}\) and \(\phi _0{:}\,[0,1]\rightarrow [k]\) be associated to \(W_0\) as in definition (18).
Lemma 5
There exist a permutation \(\pi \) of [k] and a partition \(\mathcal {P}=(P_1,\ldots , P_{q_1})\) of [k] made of successive intervals such that the following holds. Let \(\mathbf{Q}\) be the matrix obtained from \(\mathbf{Q}_0\) by jointly applying the permutation \(\pi \) to its rows and its columns. Denote by \(\phi = \pi \circ \phi _0\), and for \(a=1,\ldots , k\), \(\lambda _a:= \lambda (\phi ^{-1}(a))\). There are two matrices \(\mathbf{Q}^{(ap)}\) and \(\mathbf{Q}^{(ap,+)}\in [0,1]^{k\times k}\) that are \(q_1\)-block-constant according to the partition \(\mathcal {P}\) and that satisfy
According to Lemma 5, there exists two \(q_1\)-block constant matrices \(\mathbf{Q}^{(ap)}\) and \(\mathbf{Q}^{(ap,+)}\) that approximate well \(\mathbf{Q}\) with respect to some weighted cut norm. As for (41), the weights are respectively \(\lambda _b\) and \(\sqrt{\lambda _a}\) whereas for (42), the weights are \(\sqrt{\lambda _a}\) and \(\sqrt{\lambda _b}\). Informally, these weights arise for the following reason: writing \(\widehat{\lambda }_a\) as the empirical weight of group a in \(\widehat{W}\) (see Step 2 for the definition), we have \(\widehat{\lambda }_a-\lambda _a = O_P(\sqrt{\lambda _a/n})\).
Invoking Lemma 5, we consider the graphons
Obviously, W is weakly isomorphic to \(W_0\).
Step 2: Construction of a suitable representation \(\widehat{W}\)of\(\widetilde{f}_{{\varvec{\Theta }}_0'}\)in the quotient space\(\widetilde{\mathcal {W}}^+\).
Recall that \(\xi _1,\ldots ,\xi _n\) are the i.i.d. uniformly distributed random variables in the W-random graph model (1) and that \(\phi \) is defined in the previous step. For \(a=1,\ldots , k\), let
be the (unobserved) empirical frequency of the group a corresponding to a finer partition of [0, 1] given by \(\phi \). For \(l=1,\ldots , q_1\), let
be the (unobserved) empirical frequency of the group l corresponding to a coarser partition P of [0, 1] given by \(\mathcal {P}\circ \phi \).
The relations \(\sum _{a=1}^k \lambda _a=\sum _{a=1}^k {\hat{\lambda }}_a=1\) imply
Consider a function \(\psi {:}\,[0,1]\rightarrow [k]\) such that:
-
(i)
For all \(a\in [k]\), \(\lambda (\{x,\ \psi (x)= \phi (x)= a\})=\widehat{\lambda }_a \wedge \lambda _a \),
-
(ii)
for all \(l\in [q_1]\), \(\lambda \Big [ \{x, \psi (x) \in P_l \text { and }\phi (x)\in P_l\} \Big ]= \omega _l\wedge \widehat{\omega }_l\),
-
(iii)
for all \(a\in [k]\), \(\lambda (\psi ^{-1}(a))= \widehat{\lambda }_a\).
Such a function \(\psi \) exists. To see it, we first construct \(\psi \) to satisfy (i) and (iii):
-
For each a such that \(\lambda _a>\widehat{\lambda }_a\), conditions (i) and (iii) are trivially satisfied if we take \(\psi ^{-1}(a)\) to be any subset of \(\phi ^{-1}(a)\) of Lebesgue measure \(\widehat{\lambda }(a)\). Then, there is a subset of \(\phi ^{-1}(a)\) of Lebesgue measures \(\lambda _a-\widehat{\lambda }_a\) left non-assigned. Summing over all such a, we see that there is a union of subsets with Lebesgue measure \(m_+:=\sum _{a{:}\,\lambda _a>\widehat{\lambda }_a} (\lambda _a-\widehat{\lambda }_a)\) left non-assigned.
-
For a such that \(\lambda _a<\widehat{\lambda }_a\), we must have \(\psi (x)=a\) for \(x\in \phi ^{-1}(a)\) to satisfy (i). On the other hand, to meet condition (iii) we need additionally to assign \(\psi (x)=a\) for x on a set of Lebesgue measure \({\hat{\lambda }}_a-\lambda _a\). Summing over all such a, we need additionally to find a set of Lebesgue measure \(m_-:=\sum _{a: \widehat{\lambda }_a> \lambda _a} (\lambda _a-\widehat{\lambda }_a)\) to make such assignments. But this set is readily available as the union of non-assigned intervals for all a such that \(\lambda _a>\widehat{\lambda }_a\) since \(m_+=m_-\) by virtue of (44).
Now, to ensure that condition (ii) is satisfied, we assign as a priority \(\psi (x)\) to values belonging to the same partition element as \(\phi (x)\). Again, (44) ensures that this is possible.
Finally, define the graphons \(\widehat{W}(x,y)= \varvec{Q}_{\psi (x),\psi (y)}\), \(\widehat{W}_1(x,y)= \varvec{Q}^{(ap)}_{\psi (x),\psi (y)}\), and \(\widehat{W}_1^+(x,y)= \varvec{Q}^{(ap,+)}_{\psi (x),\psi (y)}\) where \(\varvec{Q}\), \(\varvec{Q}^{(ap)}\), and \(\varvec{Q}^{(ap,+)}\) are as in (43). Notice that in view of (iii) \(\widehat{W}\) is weakly isomorphic to the empirical graphon \(\widetilde{f}_{\varvec{\Theta }'_0}\). Let \(\mathcal {R}= \{x, \phi (x)\ne \psi (x)\}\). Since W and \(\widehat{W}\) match on \(\mathcal {R}^c\times \mathcal {R}^c\), the purpose of (i) is to minimize the Lebesgue measure of the support of \(W-\widehat{W}\). With properties (i) and (iii) alone, it would be possible to prove that \({\mathbb {E}}[\Vert W-\widehat{W}\Vert _{\square }]\le C \sqrt{k/n}\) as the Lebesgue measure of its support is at most of order \(\sqrt{k/n}\). We will improve this rate by a logarithmic term as (ii) will enforce that the cut norm of \(W-\widehat{W}\) is much smaller than its Lebesgue measure.
Step 3: Control of the cut norm. Since \(\delta _{\square }(\cdot ,\cdot )\) is a metric on the quotient space \(\widetilde{\mathcal {W}}^+\),
By definition of \(\psi \), the two functions W(x, y) and \(\widehat{W}(x,y)\) are equal except possibly when either x or y belongs to \(\mathcal {R}\). As a consequence of triangular inequality and of the symmetry of \(W-\widehat{W}\), we get
First, we focus on \(\mathbb {E}[\Vert (W-\widehat{W})|_{\mathcal {R}\times \mathcal {R}^c}\Vert _{\square }]\), the second term being handled similarly at the end of the proof. For a and b in [k], we write \(a\sim _{P} b\) (resp. \(a \not \sim _{P} b\)) when a and b belongs (resp. do not belong) to the same element of the partition P. Define
Obviously, we have \(\mathcal {R}_2\subset \mathcal {R}\). Property (ii) of \(\psi \), implies that \(\lambda (\mathcal {R}_2)=\sum _{a=1}^{q_1}(\omega _a-\widehat{\omega }_a)_+\). We shall rely on the decomposition \(W= W_1 + (W-W_1)\) and \(\widehat{W}= \widehat{W}_1 + (\widehat{W} - \widehat{W}_1)\). For any \(x\in \mathcal {R}{\setminus } \mathcal {R}_2\), we have by definition (43) of \(W_1\) that \((W_1-\widehat{W}_1)(x,y)=0\). Together with the triangular inequality, this yields
To control the first expression in the rhs, we simply bound the cut norm of the difference by its \(l_1\) norm
since \(W_1\) and \(\widehat{W}_1\) take values in [0, 1]. Then, relying on the fact that \(n\widehat{\omega }_a\) is distributed as a Binomial random variable with parameters \((n,\omega _a)\) and on Cauchy–Schwarz inequality, we get \(\mathbb {E}\left| \omega _a -\widehat{\omega }_a\right| \le \sqrt{\frac{\omega _a(1-\omega _a)}{n}}\) and
where we used again Cauchy–Schwarz in the last line. Let us turn to the second and third expressions in (46). To this end, we introduce a new kernel function U. For \(a=1,\ldots , k\), define \(\widehat{\lambda }^{\delta }_a=|\lambda _a-\widehat{\lambda }_a|\) and the functions \(F_{\widehat{\lambda }^{\delta }}{:}\,[k]\rightarrow \left[ 0,\sum _a|\lambda _a-\widehat{\lambda }_a|\right] \) and \(F_{\phi }{:}\,[k]\mapsto [0,1]\) by
For any \(a,b\in [k]\), set \(\widehat{\Pi }_{a,b}= [F_{\widehat{\lambda }^{\delta }}(a-1),F_{\widehat{\lambda }^{\delta }}(a) )\times [F_{\phi }(b-1), F_{\phi }(b) )\) and let U be a \(k\times k\) step kernel on \([0, \sum _a|\widehat{\lambda }_a-\lambda _a|]\times [0,1]\) defined by
By definition of \(\mathcal {R}\) and of the function \(\psi \), we have that for any \(a\in [k]\), \(\lambda (\phi ^{-1}(a))\cap \mathcal {R})= (\lambda _a-\widehat{\lambda })_+\) and \(\lambda (\psi ^{-1}(a))\cap \mathcal {R}^c)= \lambda _a\wedge \widehat{\lambda }\). As a consequence, the restriction of \((W-W_1)\) to \(\mathcal {R}\times \mathcal {R}^c\) is, up to a measure preserving bijection of its rows and of its columns, equal to the restriction of U to the set \((\cup _{a{:}\,\lambda _a>\widehat{\lambda }_a}[F_{\widehat{\lambda }^{\delta }}(a-1),F_{\widehat{\lambda }^{\delta }}(a)))\times (\cup _{a} [F_{\phi }(a-1),F_{\phi }(a-1)+\widehat{\lambda }_a\wedge \lambda _a) \). This entails that
On the other hand, for any \((x,y)\in \mathcal {R}\times \mathcal {R}^c\),
by the definition of \(\mathcal {R}\). In view of the definition of \(\psi \), for any \(a\in [k]\) we have \(\lambda (\phi ^{-1}(a))\cap \mathcal {R})= (\widehat{\lambda }- \lambda _a)_+\). As a consequence, the restriction of \((\widehat{W}-\widehat{W}_1)\) to \(\mathcal {R}\times \mathcal {R}^c\) is, up to a measure preserving bijection of its rows and of its columns, equal to the restriction of U to the set \((\cup _{a{:}\,\lambda _a<\widehat{\lambda }_a}[F_{\widehat{\lambda }^{\delta }}(a-1),F_{\widehat{\lambda }^{\delta }}(a) ))\times (\cup _{a} [F_{\phi }(a-1),F_{\phi }(a-1)+\widehat{\lambda }_a\wedge \lambda _a) \). This implies that \( \Vert (\widehat{W}-\widehat{W}_1)|_{\mathcal {R}\times \mathcal {R}^c} \Vert _{\square }\le \Vert U \Vert _{\square }\). Thus, we only have to control \(\mathbb {E}[\Vert U\Vert _{\square }]\).
Step 4: Control of\(\mathbb {E}[\Vert U\Vert _{\square }]\). Define the sets \(\mathcal {B}_1:= \prod _{a=1}^{k}[0,|\widehat{\lambda }_a-\lambda _a |]\) and \(\mathcal {B}_2:= \prod _{a=1}^{k}\left[ 0,\left| \lambda _a\right| \right] \). Then, the cut norm of U writes as
since the supremum of a linear function on a convex set is achieved at an extremal point. The random variable \(|\widehat{\lambda }_a-\lambda _a|\) is in expectation of the order \(\sqrt{\lambda _a/n}\). If we could replace each \(|\widehat{\lambda }_a-\lambda _a|\) by \(\sqrt{\lambda _a/n}\) in (50), then thanks to (41), we could prove that \(\Vert U\Vert _{\square }\) is (up to a multiplicative constant) less than \(\sqrt{k/(n\log (k))}\). Unfortunately, if we directly applied Bernstein’s inequality or the bounded difference inequality to simultaneously control \(|\widehat{\lambda }_a-\lambda _a|\) over all \(a\in [k]\) or to simultaneously control \(\sum _{a\in S,b\in T} \lambda _b |\widehat{\lambda }_a-\lambda _a| (\mathbf{Q}_{ab}- \mathbf{Q}^{(ap)}_{ab})\) over all \(S,T\subset [k]\), we would lose at least a logarithmic factor.
To bypass this issue, we adapt Lemma 10.9 of [28], which is a key point in the proof of sampling Lemma for graphons (Lemma 10.5 in [28]). Given a bounded non-symmetric kernel \(W\in \mathcal {W}_{\mathcal {X},\mathcal {Y}}\), let us define the following one-side version of the cut norm:
where we take the supremum without any absolute value. As a consequence, the cut norm \(\Vert W\Vert _{\square }\) is the maximum \(\Vert W\Vert ^{+}_{\square }\) and \(\Vert -W\Vert ^{+}_{\square }\).
Lemma 6
Let \(W\in \mathcal {W}_{[0,u],[0,v]}[k]\) and let \(\varvec{Q}\in \mathbb {R}^{k\times k}\), \(\phi _1{:}\,[0,u]\rightarrow [k]\) and \(\phi _2{:}\,[0,v]\rightarrow [k]\) be associated to W as in (32). For \(a=1,\ldots , k\), define \(\alpha _a:= \lambda (\phi _1^{-1}(\{a\}))\) and \(\beta _a:= \lambda (\phi _2^{-1}(\{a\}))\). Given any subset \(R\subset [k]\), let
Finally, we define for any \(S,T\subset [k]\), \(W[S,T]:= \sum _{a\in S,b\in T} \alpha _a\beta _b \varvec{Q}_{ab}.\) Then, for any integer q with \(1\le q\le k\), we have
Note that in contrast to Eq. (50) where one considers a supremum of \(2^{2k}\) sums, only \(k^{2q}\) terms are involved in (52) up to the price of an additive term of order \(q^{-1/2}\). The difficulty is that we will apply this lemma to U for which these \(k^{2q}\) will turn out to be random.
In the sequel, we fix \(q= \lfloor \sqrt{k}\rfloor \) and apply Lemma 6 to U. Then, we can take \(u=v=1\). Since \(\sum _{a=1}^k\lambda _a=1\) and since we assumed at the beginning of the proof that the weights \(\lambda _a\) are all smaller than 2 / k, it follows that \((k\sum _{a=1}^k\lambda _a^2)^{1/2}\le \sqrt{2}\). Let M and N denote the random variables \(M:=\sum _{a=1}^k |\widehat{\lambda }_a-\lambda _a|\) and \(N:=\left( \sum _{a=1}^k k|\widehat{\lambda }_a-\lambda _a|^2\right) ^{1/2}\). Both M and N are functions of the independent random variables \((\xi _1,\ldots , \xi _n)\). Besides, if we change the values of one of these \(\xi '_i\) the value of M changes by at most 2 / n and the value of N changes by at most \(\sqrt{2k}/n\). As a consequence, we may apply the bounded difference inequality (Lemma 1) to these two random variables. Then, with probability larger than \(1- 2\exp (-\sqrt{k}/\log (k))\), one has
In (53), (54) we bound the expectation using that, since \(\xi _1,\ldots ,\xi _n\) are i.i.d. uniformly distributed random variables, \(n\widehat{\lambda }_a\) has a binomial distribution with parameters (n, \(\lambda _a\)) and the Cauchy–Schwarz inequality:
Bound (54) and \((k\sum _{a=1}^k\lambda _a^2)^{1/2}\le \sqrt{2}\), implies that for U, with probability larger than \(1- 2\exp (-\sqrt{k}/\log (k))\),
Fix any two subsets \(R_1, R_2\subset [k]\) of size less than or equal to q. In view of (52), one needs to control the following random variable
It is done in the following Lemma:
Lemma 7
Let \(R_1, R_2\) be two subsets of [k] of size less than or equal to q and \(Z_{R_1,R_2}\) given by (56). Then, we have that with probability larger than \(1 - (1+2k)\exp (-\sqrt{k}/\log (k))\),
Now, it follows from Lemma 6 together with (55) and Lemma 7 that, with probability larger than \(1 - (3+2k)\exp (-\sqrt{k}/\log (k))\),
Controlling analogously \(\Vert -U\Vert ^{+}_{\square }\), we conclude that there exists an event \(\mathcal {A}\) of probability larger than \(1- 10\exp (-\sqrt{k}/\log (k))\) such that, on \(\mathcal {A}\),
To finish the control of \({\mathbb {E}}[\Vert U\Vert _{\square }]\), we use the rough bound \(\Vert U\Vert _{\square }\le \Vert U\Vert _1\le \sum _{a=1}^k |\widehat{\lambda }_a-\lambda _a|\) on the complementary event \(\bar{\mathcal {A}}\).
where we use (53). Now, using the decomposition (46), (47) and (49), we can conclude that
The following lemma gives a corresponding bound on the second term \(\left\| (W-\widehat{W})\right| _{\mathcal {R}\times \mathcal {R}}\Vert _{\square }\) in (45). The proof is somewhat analogous to that of the control of \(\left\| (W-\widehat{W})|_{\mathcal {R}\times \mathcal {R}^c}\right\| _{\square }\) and is postponed to the end of the section.
Lemma 8
We have
In view of (45), we have proved Theorem 1. \(\square \)
Proof of Lemma 5
For \(a\in [k]\), we denote \((\lambda _0)_a= \lambda (\phi _0^{-1}(a))\) and \(u_a= \frac{\sqrt{(\lambda _0)_a}}{\sum _b \sqrt{(\lambda _0)_b}}\). For any \(b\in [k]\), define the cumulative distribution functions \(F_{0}(b)=\sum _{a=1}^{b} (\lambda _0)_a\) and \(F_1(b)= \sum _{a=1}^{b} u_a\). For \(a,b\in [k]\), let \((\Pi _d)_{ab}=[F_{0}(a-1),F_{0}(a))\times [F_{1}(b-1),F_{1}(b))\) and \((\Pi ^+_d)_{ab}=[F_{1}(a-1),F_{1}(a))\times [F_{1}(b-1),F_{1}(b))\). In order to construct a suitable \(q_1\)-step kernel we consider first the (non necessarily symmetric) kernels \(W_d\) and \(W^+_d\) defined by
In comparison to \(W_0\), the length of the steps in \(W_d\) and \(W_d^+\) has been modified.
Lemma 9
Let \(W\in \mathcal {W}_{[0,1],[0,1]}\) be a k-step kernel defined by
where \(\varvec{Q}\in [0,1]^{k\times k}\) and \( (S_{1},\dots , S_{k})\) and \((T_{1},\dots , T_{k})\) are two partitions of [0, 1] into a finite number of measurable sets. For any integer \(q_0\ge 2\), there exist a \(q_0\)-step kernel \(W^{(ap)}\in \mathcal {W}^{+}_{[0,1],[0,1]}\) satisfying
-
(i)
for any \((a,b)\in [k]\), \(W^{(ap)}\) is constant on \(S_{a}\times T_{b}\) and
-
(ii)
\(\left\| W- W^{(ap)}\right\| _{\square }\le \frac{ C}{\sqrt{\log (q_0)}}\).
The second property (ii) is just the consequence of the weak Regularity Lemma for kernels [19] (see also Corollary 9.13 in [28]). The first property, (i), follows from the explicit construction of the approximate kernel by Kannan and Frieze (see the proof of Lemma 9.10 in [28]). For the sake of completeness, we give the details in the end of this section.
Fix \(q_0=\lfloor k^{1/4}\rfloor \). Note that \(q_0\ge 2\) since we assume that \(k\ge 16\). We denote by \(W_{d}^{(ap)}\) and \(W^{(ap,+)}_{d}\) the \(q_{0}\)-step kernels given by Lemma 9 to respectively approximate \(W_{d}\) and \(W^{(+)}_{d}\). In virtue of Property (i), there exist two matrices \(\varvec{Q}^{(ap)}_{0}\) and \(\varvec{Q}_{0}^{(ap,+)}\) in \([0,1]^{k\times k}\) such that
There exist two partitions \(\mathcal {P}_d\) and \(\mathcal {P}_d^+\) of [k] such that \(\varvec{Q}^{(ap)}_{0}\) is block constant according to \(\mathcal {P}_d\) and \(\varvec{Q}^{(ap,+)}_{0}\) is block constant according to \(\mathcal {P}^+_d\). Let \(\mathcal {P}^*\) be the coarsest partition that refines both \(\mathcal {P}\) and \(\mathcal {P}_d^+\). As a consequence, \(\mathcal {P}^*\) is made of less than \(q_0^2\le q_1\) subsets. By possibly refining \(\mathcal {P}^*\), we may assume without loss of generality that \(\mathcal {P}^*=(P^*_1,\ldots , P^*_{q_1})\) is made of exactly \(q_1\) elements. Let \(\pi \) be a permutation of [k] transforming \(\mathcal {P}^*\) in a partition \(\mathcal {P}=(P_1,\ldots ,P_{q_1})\) with \(P_a=\{\pi (b), b\in P^*_a\}\) made of consecutive intervals. Denoting \(\varvec{\Pi }\) the corresponding permutation matrix, we finally take
Now we are ready to prove (41) and (42). Recall that we denote \(\phi =\pi \circ \phi _0 \) and \(\lambda _a:= \lambda (\phi ^{-1}(a))\) for \(a\in [k]\). Define the sets \(\mathcal {B}_1:= \prod _{a=1}^{k}[0,u_{\pi (a)}]\) and \(\mathcal {B}_2:= \prod _{a=1}^{k}[0,\lambda _a]\). Since \(W_d-W_{d}^{(ap)}\) is a k-step function, its cut norm writes as
since the supremum is achieved at an extremal point of the convex and in the last inequality we use property (ii) of Lemma 9. Now (5859) and the definition of \(u_{\pi (a)}\) imply
by Cauchy–Schwarz inequality. We have proved (41). The second inequality (42) is derived similarly. \(\square \)
Proof of Lemma 9
We adapt the proof of the weak Regularity Lemma for symmetric kernels [28, Lemma 9.9] to non symmetric ones. We use the following extension of Lemma 9.11(a) in [28].
Lemma 10
For every \(W \in \mathcal {W}_{[0,1],[0,1]}[k]\) such that
where \(\varvec{Q}\in \mathbb {R}^{k\times k}\) and \(\mathcal {P}=\left\{ \left( S_{1},\dots , S_{k}\right) , \left( T_{1},\dots , T_{k}\right) \right\} \) are two partitions of [0, 1] into a finite number of measurable sets, there are two sets \(\mathcal {A},\mathcal {B}\subset [k]\) and a real number \(0\le a\le \max _{a,b} |\varvec{Q}_{ab}|\) such that, for \(S'=\cup _{a\in \mathcal {A} } S_a\quad \text {and }\quad T'=\cup _{b\in \mathcal {B} }T_b\),
Now we apply Lemma 10 repeatedly, to get pairs of sets \(S'_i,T'_i\) and real numbers \(a_i\) such that for any positive integer j, \(W_j=W-\sum _{i=1}^{j}a_{i}\mathbb {1}_{S'_{i}\times T'_{i}}\) we have
Fix some integer \(k_0>0\). Since the right-hand side of the above equation remains non-negative, there exists \(0\le i< k_0\) with \(\Vert W_i \Vert ^{2}_{\square }\le 1/k_0\). Now putting \(a_{l}=0\) for \(l>i\) we get that for any \(W\in \mathcal {W}_{[0,1],[0,1]}[k]\) and any \(k_0\ge 1\) there are \(k_0\) pairs of subsets \(S'_{i},T'_{i}\subset [0,1]\) and \(k_0\) real numbers \(a_i\) such that
Note that the approximation \(W^{ap}= \sum _{i=1}^{k_0}a_{i}\mathbb {1}_{S'_{i}\times T'_{i}}\) is a step function with at most \(2^{k_0}\) steps and \(a_i\ge O\), for all i. On the other hand, by construction we have that for any \((a,b)\in [k]\), \(W^{(ap)}\) is constant on all sets of the form \(S_{a}\times T_{b}\). We conclude by taking \(k_0=\lfloor \log (q_0)/\log (2)\rfloor \). \(\square \)
Proof of Lemma 10
This lemma is proved in [28, Lemma 9.11] for symmetric kernels. For readers convenience we get the details here. Let W be a k-step kernel and let \(\left( S_{1},\dots , S_{k}\right) , \left( T_{1},\dots , T_{k}\right) \) be two measurable partitions of [0, 1] such that W is constant on each set \(S_i\times T_j\). Relying on a convexity argument as in the proof of Lemma 5, the cut norm is achieved for measurable sets S and T that are unions of \(S_i\) and \(T_j\) respectively, that is
where \(S= \cup _{a\in \mathcal {A}}S_a\) and \(T= \cup _{b\in \mathcal {B}}T_b\) with \(\mathcal {A}\), \(\mathcal {B}\subset [k]\). Let \({\mathbf {a}}=\frac{1}{\lambda (S)\lambda (T)}\Vert W\Vert _{\square }\). Then, we have
which completes the proof. \(\square \)
Proof of Lemma 6
This proof closely follows that of Lemma 10.9 in [28]. It is easy to see that
so we only need to bound these expressions. Let Q and \(Q'\) be independent uniformly chosen q-subset of [k] and let \({\mathbb {E}}_{Q}\) (resp. \({\mathbb {E}}_{Q'}\)) denote the expectation with respect to Q (resp. \(Q'\)). We shall prove that, for any \(S,T\subset [k]\)
By symmetry, this will imply
so that gathering both inequalities yields to
Since the above expectation is less than or equal to \(\sup _{R_i,\ |R_i|\le q}W\left[ R_2^{r,W},R_1^{l,W}\right] \), this will conclude the proof. Thus, we only have to show (61). Note that \(W[S,T]\le W[T^{r,W}, T]\) implies that it suffices to prove
Let us denote Z the above difference of expectations. For any \(a\in [k]\), write \(B_a= \sum _{b\in T}\beta _b \varvec{Q}_{ab}\) and \(A_a= \sum _{b\in T\cap Q}\beta _b \varvec{Q}_{ab}\). By the definition (51), we have that \(B_a\) is non-negative for \(a\in T^{r,W}\) and \(B_a\le 0\) if \(a\not \in T^{r,W}\). In the same way, \(A_{a}>0\) for \(a\in (Q\cap T)^{r,W}\) and \(A_{a}\le 0\) for \(a\notin (Q\cap T)^{r,W}\). Denoting \({\mathbb {P}}_Q\) the probability with respect to Q, we obtain
Now, using \({\mathbb {E}}_Q[A_a]=q B_a/k\), it follows from the Chebyshev inequality that, for \(a\in T^{r,W}\), we have \({\mathbb {P}}_Q[A_a<0]\le \mathrm{Var}_Q[A_a]/{\mathbb {E}}_Q^2[A_a]\). Since a probability is smaller or equal to one, it follows that \({\mathbb {P}}_Q[A_a<0]\le \sqrt{\mathrm{Var}_Q[A_a]}/\vert {\mathbb {E}}_Q[A_a]\vert \). Similarly, for \(a\notin T^{r,W}\) we also have that \({\mathbb {P}}_Q[A_a>0]\le \sqrt{\mathrm{Var}_Q[A_a]}/|{\mathbb {E}}_Q[A_a]|\). Coming back to Z, this yields
Working out the variance, we get \(\mathrm{Var}_Q[A_a]\le \tfrac{q}{k}\sum _{b\in T}\beta ^2_b \varvec{Q}^2_{ab}\le q (\sum _{b\in [k]}\beta ^2_b)/k\), which concludes the proof. \(\square \)
Proof of Lemma 7
Note that in (56), the definition of \(Z_{R_1,R_2}\), the set \(R_2^{r,U}\) is deterministic whereas the set \(R_1^{l,U}\) only depends on \((\widehat{\lambda }_a)_{a\in R_1}\). We can upper bound \(Z_{R_1,R_2}\) in the following way:
where we use \(\left| \sum _{b} \lambda _b (\varvec{Q}_{ab}-\varvec{Q}^{ad}_{ab})\right| \le 1\). We set
Conditionally to \((\widehat{\lambda }_a)_{a\in R_1}\), \(T_{R_1,R_2}\) is distributed as a function of \(n-n\sum _{a\in R_1}\widehat{\lambda }_a\) i.i.d. random variables \(\xi '_i\) such that \({\mathbb {P}}[\xi '=a]= \lambda _a/ (1-\sum _{a\in R_1}\lambda _a)\) for any \(a\in [k]{\setminus } R_1\). Besides, if we change the values of one of these \(\xi '_i\) the value of this expression changes by at most 2 / n. It then follows from the bounded difference inequality (Lemma 1) that, for any \(t>0\)
Let us bound this conditional expectation:
Now, using Cauchy–Schwarz inequality, we have
where we used that \(\lambda _b\le 2/k\), \(|R_1|\le q \le k^{1/2}\) and \(k\ge 8\). The supremum in (66) is achieved for subsets (\(S^*,T^*\)) such that for all \(a\in S^*\), \(\sum _{b\in T^*}\lambda _b (\varvec{Q}_{ab}-\varvec{Q}^{ad}_{ab})\) is non-negative (otherwise this contradicts the optimality of \(S^*,T^*\)). As a consequence, we can plug the upper bounds on \({\mathbb {E}}\left[ \left| \widehat{\lambda }_a-\lambda _a|\right| (\widehat{\lambda }_c)_{c\in R_1}\right] \) into (66):
where we used the property (41) of \(\varvec{Q}^{ad}\). Coming back to (65) and integrating the deviation inequality with respect to \((\widehat{\lambda }_a)_{a\in R_1}\), we conclude that, for any \(t>0\)
Fixing \(t= 2\log (k) q + \sqrt{k}/\log (k)\) and taking an union bound over all possible \(R_1\), \(R_2\), we derive that
on an event of probability higher than \(1-\exp (-\sqrt{k}/\log (k))\).
Next we bound \(\max _{a=1, \dots k} |\widehat{\lambda }_a-\lambda _a|\). Recall that \(n\widehat{\lambda }_a\) has a binomial distribution with parameters (n, \(\lambda _a\)) and \(\lambda _{a}\le 2/k\). For any \(a\in [k]\), applying Bernstein’s inequality to \(|\widehat{\lambda }_a-\lambda _a|\) we get
Taking \(t=C \sqrt{n/\log (k)}\) (for a suitable constant \(C>0\)) and applying the union bound, we derive that with probability larger than \(1- 2k\exp (-\sqrt{k}/\log (k))\)
The bound (67) together with (68) imply the statement of Lemma 7. \(\square \)
Proof of Lemma 8
As the control of \((W-\widehat{W})|_{\mathcal {R}\times \mathcal {R}}\) is quite similar to that of \((W-\widehat{W})|_{\mathcal {R}\times \mathcal {R}^c}\), we only sketch the main steps. Relying on the graphon \(W_1^+\) [defined in (43)], we have the following decomposition:
Since \((W_1^+-\widehat{W}_1^+)(x,y)\) is zero except if \(x\in \mathcal {R}_2\) or \(y\in \mathcal {R}_2\), we bound the first expression by its \(l_1\) norm as for \(W_1-\widehat{W}_1\):
The two last expressions in (69) are bounded by the cut norm of a kernel V defined as follows. For any \(a,b\in [k]\), define \(\widetilde{\Pi }_{a,b}= [F_{\widehat{\lambda }^{\delta }}(a-1),F_{\widehat{\lambda }^{\delta }}(a))\times [F_{\widehat{\lambda }^{\delta }}(b-1), F_{\widehat{\lambda }^{\delta }}(b))\) where \(F_{\widehat{\lambda }^{\delta }}(\cdot )\) has been defined in (48). Let V be the \(k\times k\) step kernel on \(\left[ 0, \sum _a|\widehat{\lambda }_a-\lambda _a|\right] ^2\) given by
Now, as for the restrictions of \(W-W_1\) and \({\widehat{W}}-{\widehat{W}}_1\) to \(\mathcal {R}\times \mathcal {R}^c\), we have
Thus, it boils down to controlling \(\mathbb {E}\left[ \Vert V\Vert _{\square }\right] \). Since V is a k-step kernel, its cut norm writes as
As for the kernel U in the main proof, we rely on the Lemma 6. The random variables \(\sum _{a}|\widehat{\lambda }_a-\lambda _a|\) and \((\sum _{a}|\widehat{\lambda }_a-\lambda _a|^2)^{1/2}\) are controlled as in (53) and (54).
Fix any two subsets \(R_1, R_2\subset [k]\) of size less than or equal to q and define
The set \(R_1^{l,V}\) only depends on \((\widehat{\lambda }_a)_{a\in R_1}\) and \(R_2^{r,V}\) only depends on \((\widehat{\lambda }_a)_{a\in R_2}\). We have
since \(\sum _{a\in [k]}|\widehat{\lambda }_a -\lambda _a|\le 2\). We set
Write \(R:= R_1\cup R_2\) and \(\widehat{\lambda }_{\{R\}}:= (\widehat{\lambda }_a)_{a\in R}\). Conditionally to \(\widehat{\lambda }_{\{R\}}\), \(T_{R_1,R_2}\) is a function of \(n-n\sum _{a\in R}\widehat{\lambda }_a\) independent random variables. Besides, if we change the values of one of these independent random variables the value of \(T_{R_1,R_2}\) changes by at most 4 / n. Hence, the bounded difference inequality enforces that, for any \(t>0\),
The conditional expectation is upper bounded by
Here, unfortunately, we cannot directly replace \({\mathbb {E}}\big [|\widehat{\lambda }_a-\lambda _a||\widehat{\lambda }_b-\lambda _b|\big |\widehat{\lambda }_{\{R\}}\big ]\) by an upper bound of it because this expression does not factorize. We shall prove that \({\mathbb {E}}\big [|\widehat{\lambda }_a-\lambda _a||\widehat{\lambda }_b-\lambda _b|\big |\widehat{\lambda }_{\{R\}}\big ] \) is, up to a small loss, close to a product of expectations.
Write \(N:= n- n\sum _{c\in R} \widehat{\lambda }_c\), \(\lambda _{R}:= \sum _{c\in R} \lambda _c\) and \(\widehat{ \lambda }_{R}= \sum _{c\in R} {\widehat{\lambda }}_c\). Note that \(n\widehat{\lambda }_R\) has a binomial distribution with parameters (n, \(\lambda _R\)). Applying Bernstein’s inequality to \(|\widehat{\lambda }_R-\lambda _R|\) we get
Let \(\mathcal {R}= \left\{ |\widehat{\lambda }_R-\lambda _R|\le \frac{1}{\sqrt{n\log (k)}}\right\} \). Taking \(t=\sqrt{n/\log (k)}\) in (74) we have that
In what follows we assume that the event \(\mathcal {R}\) is true. Take any two distinct elements a and b of \([k]{\setminus } R\). We shall prove that the conditional expectations \({\mathbb {E}}\left[ \left| \widehat{\lambda }_a-\lambda _a\right| \left| \widehat{\lambda }_b-\lambda _b\right| \Big \vert \widehat{\lambda }_{\{R\}}\right] \) are close to the products \({\mathbb {E}}\left[ \left| \widehat{\lambda }_a-\lambda _a\right| \Big \vert \widehat{\lambda }_{\{R\}}\right] {\mathbb {E}}\left[ \left| \widehat{\lambda }_b-\lambda _b\right| \Big \vert \widehat{\lambda }_{\{R\}}\right] \). It is easy to see that conditionally on \((\widehat{\lambda }_{\{R\}},\widehat{\lambda }_a)\), \(n\widehat{\lambda }_b\) follows the Binomial distribution with parameters \(( (N-n\widehat{\lambda }_a),\lambda _b/(1- \lambda _{R} - \lambda _a))\). On the other hand, conditionally on \(\widehat{\lambda }_{\{R\}}\), \(n\widehat{\lambda }_b\) follows the Binomial distribution with parameters \((N,\lambda _b/(1- \lambda _{R}))\). Let \(z_1,z_2,\ldots , \) be a sequence of independent Bernoulli random variables with parameters \(\lambda _b/(1- \lambda _a - \lambda _{R})\), \(w_1,w_2\ldots , \) be an independent sequence of Bernoulli random variables with parameters \((1- \lambda _a -\lambda _{R})/(1- \lambda _{R})\) and \(v_1,v_2,\ldots , \) be an independent sequence of Bernoulli random variables with parameters \(\lambda _b/(1- \lambda _{R})\). We define the following random variables:
where we use \(\lambda _c\le 2/k\) and \(|R|\le 2\sqrt{k}\). It is easy to see that X follows the Binomial distribution with parameters \((N-n\widehat{\lambda }_a)\) and \(\lambda _b/(1- \lambda _{R} - \lambda _a)\) and Y follows the Binomial distribution with parameters N and \(\lambda _b/(1- \lambda _{R})\). Hence, we have that
Relying our coupling between X and Y, we obtain
On the other hand, conditionally on \(\widehat{\lambda }_{\{R\}}\), \(n\widehat{\lambda }_a\) follows the Binomial distribution with parameters \((N,\lambda _a/(1- \lambda _R))\) so that Cauchy–Schwarz inequality implies
where we use that \(\lambda _a\le 2/k\) and the definition of the event \(\mathcal {R}\). Similarly we compute
Plugging (76–78) into (75) we get
where we use \(\lambda _b,\lambda _a\le 2/k\). For \(a=b\), (77) implies that the above difference is of order \((kn)^{-1}\). Going back to (73), we obtain that
Take \(S^*\) and \(T^*\) being two sets maximizing the above expression. Then, for all \(a\in S^*\) we have that \(\sum _{ b\in T^*}{\mathbb {E}}\big [|\widehat{\lambda }_b-\lambda _b|\big |\widehat{\lambda }_R\big ] (\varvec{Q}_{ab}-\varvec{Q}^{ad,+}_{ab})\) is non-negative. As a consequence, using (77), we have that
as soon as the event \(\mathcal {R}\) holds. The same reasoning and \(\vert \varvec{Q}_{ab}-\varvec{Q}^{ad,+}_{ab}\vert \le 2 \) leads to
as soon as the event \(\mathcal {R}\) holds. Going back to (72) and integrating the deviation inequality with respect to \(\widehat{\lambda }_{\{R\}}\), we conclude that
where we use \({\mathbb {P}}(\mathcal {R})\ge 1-2e^{-\sqrt{k}/\log (k)}\). From this point the proof is identical to that of the main proof: we fix \(t= 2\log (k) q + \sqrt{k}/\log (k)\) and take an union bound over all possible \(R_1\) and \(R_2\) to derive that
on an event of probability higher than \(1-3\exp (-\sqrt{k}/\log (k))\). Then, as in the main proof, Lemma 6 together with (55) and (68) enforce that \( \Vert V\Vert ^{+}_{\square }\le C \sqrt{K/(n\log (k))}\) with probability larger than \(1 - (5+2k)\exp (-\sqrt{k}/\log (k))\). By symmetry, we can find an event \(\mathcal {A}\) of probability larger than \(1- (10+4k)\exp (-\sqrt{k}/\log (k))\) such that, on \(\mathcal {A}\),
In order to control \({\mathbb {E}}[\Vert V\Vert _{\square }]\) on the complementary event \(\bar{\mathcal {A}}\) we use the rough bound
which implies
where we use (53). Together with the decomposition (69), (70) and (71), we conclude that
\(\square \)
F Proof of Theorem 2
It is enough to prove separately the following two minimax lower bounds:
The proof of (80) is identical to the proof of (45) in [26] so we just sketch the main idea. Fix some \(0<\epsilon \le 1/4\). We consider \(W_1\) to be the constant graphon with \(W_1(x,y)\equiv 1/2\), and \(W_2\in \mathcal {W}^+[2]\) to be the 2-step graphon with \(W_{2}(x,y)=1/2+\epsilon \) if \(x,y\in [0,1/2)^2\cup [1/2,1]^2\) and \(W_{2}(x,y)=1/2-\epsilon \) elsewhere. Obviously, we have \(\delta _{\square }[\rho _n W_1,\rho _n W_2]=\rho _n\epsilon \). Then, standard testing arguments [32] ensure that the minimax risk \( \inf _{\widehat{f}}\sup _{W_0\in \mathcal {W}^+[2]}{\mathbb {E}}_{W_0}[\delta _\square (\widehat{f}, \rho _n W_0)] \) is at least of the order \(\rho _n \epsilon \) when \(\epsilon \) is chosen small enough so that the \(\chi ^2\)-distance \(\chi ^2({\mathbb {P}}_{W_2},{\mathbb {P}}_{W_1})\) is smaller than 1 / 4. According to Lemma 4.9 in [26], this is the case when \(\epsilon \) is small in front of \((\rho _n n)^{-1/2}\) which proves (80).
Henceforth, we only focus on (79). We first consider the case of k multiple of 32 and such that \(k\ge C_0\) and \(k\le C_1n\) for some sufficiently large numerical constants \(C_0\) and \(C_1\). As the collections \(\mathcal {W}^+[k]\) are nested this will imply (79) for all \(k\in [32 C_0, n]\). Afterwards, it will suffice to show (79) for \(k=2\) to prove the proposition. So, we assume that k is a multiple of 32, k is large enough and that k is small in front of n. Define \(k_1:=k/2\), \(M_k :=\lceil 128\log (k)\rceil \), \(\eta _0:=1/16\) and \(\eta _1 :=7/8\).
As for Proposition 3, we will rely on Fano’s method (Lemma 2). Hence, we shall build a collection \((W_u)\) of graphons that are well-spaced in cut distance and such that the Kullback–Leibler divergence between the associated distribution \({\mathbb {P}}_{W_u}\) remains small enough. All the graphons considered in this collection will be based on a \(k_1\times M_k\) matrix \(\varvec{B}\) such that (i) the rows of B are almost orthogonal and (ii) such that the \(l_1\) distance between permutation and convex combinations of the columns of \(\varvec{B}\) are bounded from below. Such a property will turn out to be useful when taking a lower bound on the \(\delta _{\square }\) distance between the corresponding graphons.
Lemma 11
For k large enough, there exists a matrix \(\varvec{B}\in \{-1,1\}^{k_1\times M_k}\) satisfying the following two properties:
-
(i)
For any \((a,b)\in [k_1]\) with \(a\ne b\), the inner product of two columns \(\langle \varvec{B}_{a,\cdot }, \varvec{B}_{b,\cdot }\rangle \) satisfies
$$\begin{aligned} |\langle \varvec{B}_{a,\cdot }, \varvec{B}_{b,\cdot }\rangle |\le M_k/4. \end{aligned}$$(81) -
(ii)
For any two subsets X and Y of \([k_1]\) satisfying \(|X|=|Y|=\eta _0 k_1\) and \(X\cap Y=\emptyset \), any labellings \(\pi _1{:}\,[\eta _0 k_1 ]\rightarrow X\) and \(\pi _{2}{:}\,[\eta _0 k_1 ]\rightarrow Y\), any subset Z of \([M_k]\) of size larger than \(\eta _1 M_k\) and any \(Z\times M_k\) stochastic matrix \(\omega \), we have
$$\begin{aligned} \sum _{a=1}^{ \eta _0 k_{1}}\sum _{b\in Z}\big |\varvec{B}_{\pi _1(a),b}-\sum _{c\in [M_k]}\omega _{b,c}\varvec{B}_{\pi _2(a),c}\big |\ge C M_k k_{1}, \end{aligned}$$(82)for some universal constant \(C>0\).
Taking \(\varvec{B}\) as in Lemma 11, we define the connection probability matrix \(\varvec{Q}: = (\varvec{J}+ \varvec{B})/2\) where \(\varvec{J}\) is the \(k_1\times M_k\) matrix with all entries equal to 1. Now we define a collection of step graphons based on \(\varvec{Q}\) that will only slightly differ by the weight of each step.
Fix some \(\epsilon < 1/(8k_1)\) and denote by \(\mathcal {C}_0\) the collection of vectors \(u\in \{-\epsilon ,\epsilon \}^{k_1}\) satisfying \(\sum _{a=1}^{k_1} u_a=0\). For any \(u\in \mathcal {C}_0\), define the cumulative distribution \(F_u\) on \(\{0,\ldots , k_1\}\) by the relations \(F_u(0)=0\) and \(F_u(a)= a/(2k_1) + \sum _{b=1}^a u_b\) for \(a\in [k_1]\) and the cumulative distribution G on \(\{0,\ldots , M_k\}\) by \(G(0)= 1/2\) and \(G(b)=1/2 + b/(2M_k) \). Note that \(F_u\) takes values in [0, 1 / 2] and G takes values in [1 / 2, 1]. Then, set \(\Pi _{ab}(u)=[F_{u}(a-1),F_{u}(a))\times [G(b-1),G(b))\) and define the graphon \(W_u\in \mathcal {W}^+[k_{1}+M_{k}]\) by
See Fig. 1 for a drawing of \(W_u\). Note that \(W_u\) is a fairly unbalanced \((k_1+M_k)\)-step graphon: \(M_k\) of its steps have a large weight of order \(1/\log (k)\). Besides, the \(k_1\) smaller steps are slightly unbalanced as the weight of each class is either \(1/k-\epsilon \) or \(1/k+\epsilon \). The purpose of these \(M_k\) big steps is to make the cut distances between \(W_u\) and \(W_{v}\) the largest possible (see the proof of Lemma 13).
Next, we shall consider a subcollection \(\mathcal {C}\) of \(\mathcal {C}_0\) such that the graphons \(W_u\) with \(u\in \mathcal {C}\) are well spaced. The following combinatorial result is in the spirit of the Varshamov-Gilbert lemma [32, Lemma 2.9]. It is borrowed from [26] (Lemma 4.4). For \(u\in \mathcal {C}_0\), let \(\mathcal {A}_u:=\{a\in [k_1]{:}\,u_a=\epsilon \}\). Notice that, by definition of \(\mathcal {C}_0\), we have \(|\mathcal {A}_u|=k_1/2\) for all \(u\in \mathcal {C}_0\).
Lemma 12
There exists a subset \(\mathcal {C}\) of \(\mathcal {C}_0\) such that \(\log |\mathcal {C}|\ge k_1/16\) and
for any \(u\ne v\in \mathcal {C}\).
Lemmas 11 and 12 are used to obtain the following lower bound on the distance \(\delta _{\square }(W_u,W_v)\) between two distinct graphons with u and v in \(\mathcal {C}\). This lemma is the main ingredient of the proof.
Lemma 13
There exists two positive universal constants \(C_1\) and \(C_2\) such that if \(k\epsilon \le C_2\) then, for any \((u,v)\in \mathcal {C}\) with \(u\ne v\), we have
which implies
Note that for any u and v in \(\mathcal {C}\) it is possible to build a measure-preserving transformation \(\tau \) such that \(W_u-W_v^{\tau }\) is null expect on a measurable set of Lebesgue measure of order \(k\epsilon \) (see the proof of Theorem 1 in Sect. 1 for such construction). Hence, the \(l_1\) norm of \(W_u-W_v^{\tau }\) is of order \(k\epsilon \). Lemma 13 states, that by taking the infimum over all \(\tau \) and by considering the weaker norm \(\Vert .\Vert _{\square }\), one still has a lower bound of the same order. The \(M^{-1/2}_k\) factor arises as a consequence of Lemma 4. See the proof for more details.
To apply Fano’s method, we need to upper bound the Kullback–Leibler divergence between the distribution corresponding to any two graphon \(W_u\) and \(W_v\) with u and v in \(\mathcal {C}\). Let \({\mathbb {P}}_{W_u}\) denote the distribution of \(\varvec{A}\) sampled according to the sparse W-random graph model (1) with \(W_0=W_u\). Since the matrix \(\varvec{Q}\) is fixed the difficulty in distinguishing between the distributions \({\mathbb {P}}_{W_u}\) and \({\mathbb {P}}_{W_v}\) for \(u\ne v\) comes from the randomness of the design points \(\xi _1,\ldots ,\xi _n\) in the W-random graph model (1) rather than from the randomness of the realization of the adjacency matrix \(\varvec{A}\) conditionally on \(\xi _1,\ldots ,\xi _n\). The following lemma gives an upper bound on the Kullback–Leibler divergences \(\mathcal {KL}({\mathbb {P}}_{W_u},{\mathbb {P}}_{W_v})\):
Lemma 14
For all \(u,v\in \mathcal {C}_0\) we have
Now, choose \(\epsilon \) such that \(\epsilon ^2=\frac{3}{(16)^3 nk_1}\). When k is small in front of n, this choice of \(\epsilon \) satisfies the conditions of Lemma 13. Then it follows from Lemmas 12 and 14 that
In view Fano’s Lemma (Lemma 2), inequalities (85) and (86) imply that
where \(C>0\) is an absolute constant. This completes the proof for k large enough.
Now we turn to the case \(k=2\). We reduce the lower bound to the problem of testing two hypotheses. Consider the matrix \(\varvec{B}=\left( \begin{array}{cc} 1 &{}\quad 1 \\ 1 &{}\quad -\,1 \end{array}\right) \). Given \(u\in \{-\epsilon ,+\epsilon \}\) define \(F_u(0)=0\), \(F_u(1)=1/2+u\) and \(F_u(2)=1\). Then, we set \(\Pi _{ab}(u)=[F_u(a-1),F_u(a))\times [F_u(b-1),F_u(b))\) for any \(a,b\in \{1,2\}\) and define graphons
For any measure preserving bijection \(\tau \), \((W_{\epsilon }-W^{\tau }_{-\epsilon })\) is a four-step graphon. Thanks to Lemma 4, we deduce that \(\delta _{\square }(W_{\epsilon },W_{-\epsilon })\ge C \delta _{1}(W_{\epsilon },W_{-\epsilon })\). Then, it is not hard to see that \(\delta _{1}(W_{\epsilon },W_{-\epsilon })\ge C'\epsilon \) so that \(\delta _{\square }(\rho _n W_{\epsilon },\rho _n W_{-\epsilon })\ge C'\rho _n\epsilon \). Moreover, exactly as in Lemma 14, the Kullback–Leibler divergence between \({\mathbb {P}}_{W_{\epsilon }}\) and \({\mathbb {P}}_{W_{-\epsilon }}\) is bounded by \(Cn\epsilon ^2\). Taking \(\epsilon \) of the order \(n^{-1/2}\), this divergence is small. It is therefore impossible to reliably distinguish \({\mathbb {P}}_{W_{\epsilon }}\) from \({\mathbb {P}}_{W_{-\epsilon }}\) and the estimation error is at least of order \(\rho _n \epsilon \). More formally, we use Theorem 2.2 from [32] to conclude that
where \(C>0\) is an absolute constant.
Proof of Lemma 11
Let \(\varvec{B}\) be a \(k_1\times M_k\) random matrix whose entries are independent Rademacher variables. We shall prove that, with positive probability, \(\varvec{B}\) satisfies both (81) and (82). In particular, this implies the existence of \(\varvec{B}\) satisfying both (81) and (82).
Fix \(a\ne b\). Then, \(\langle \varvec{B}_{a,\cdot }, \varvec{B}_{b,\cdot }\rangle \) is distributed as a sum of \(k_1\) independent Rademacher variables. Using Hoeffding’s inequality, we have that
By the union bound, property (81) is satisfied for all \(a\ne b\) with probability greater than \(1- k_1^2\exp [-M_k/32]\). Since \(M_k\ge 128\log (k)\), for k greater than some absolute constant, this probability is greater than 3 / 4.
Turning to (82), we first fix X, Y, Z, \(\pi _1\), \(\pi _2\), and \(\omega \). Let
We have that, conditionally on \((\varvec{B}_{b,c})_{b\in Y, c\in [M_k]}\), \(T_{X,Y,Z,\pi _1,\pi _2,\omega }\) stochastically dominates a binomial distribution with parameters \((\eta _0 k_{1})\times |Z|\) and 1 / 2. Then, Hoeffding’s inequality yields
Given any integer \(Z\in [\eta _1 M_k, M_k]\), define \(\Omega _Z\) the collection of \(Z\times [M_k]\) stochastic matrices taking values in the discrete set \(\{0,1/(8M_k),2/(8M_k),\ldots , 1\}\). Since \(X,Y\subset [k_1]\) and \(Z\subset M_k\), it is easy to see that the cardinality of the set of all possible tuples \((X,Y,Z,\pi _1,\pi _2,\omega )\) with \(\omega \in \Omega _Z\) is bounded by
Now, taking the union bound, we derive that, simultaneously for all such parameters,
with probability greater than \(1- 2^{2k_1+M_k+1} (\eta _0 k_1)!^2 (8M_k+1)^{M_k^2}\exp [-\eta _0\eta _1 k_1M_k/8]\). Using Stirling’s approximation and \(\eta _1 M_k\ge 64 \log (k)\) we get that this probability is larger than 3 / 4 for k large enough.
Finally, let us consider a general case, when matrix \(\omega \) does not necessarily belong to \(\Omega _Z\). Observe that in this case, there exists a matrix \(\omega '\in \Omega _Z\) such that \(\max _{b\in Z}\sum _{c\in [M_k]}|\omega _{b,c}-\omega '_{b,c}|\le 1/8\). This implies that
We have proved that (82) holds with probability larger than 3 / 4. As a consequence, \(\varvec{B}\) satisfies both (81) and (82) with probability larger than 1 / 2. \(\square \)
Proof of Lemma 13
We fix u and v, two different vectors in \(\mathcal {C}\), and fix \(\tau \), a measure-preserving bijection on \([0,1]\rightarrow [0,1]\). We shall prove that for \(k\epsilon \) small enough
Since \(\delta _{\square }\big (W_u,W_v\big )= \inf _{\tau }\Vert W_u(.,.) - W_v(\tau \cdot ,\tau \cdot )\Vert _{\square }\) both (84) and (85) straightforwardly follow from (87). We denote
Since \(\tau \) is measure-preserving, we have
Now, we consider three cases (i) \(\lambda (\mathcal {B}_{12})\le k_1\epsilon /64\), (ii) \(k_1\epsilon /64<\lambda (\mathcal {B}_{12})\le 1/2 - k_1\epsilon /64\) and (iii) \(\lambda (\mathcal {B}_{12})> 1/2 - k_1\epsilon /64\). In the Case (i) we shall focus on the restriction of \(W_u\) and \(W_v^{\tau }\) on \(\mathcal {B}_{11}\times \mathcal {B}_{22}\) so that these restrictions are \(k_1\times M_k\)-step functions. In the Case (ii), we focus on restrictions to \(\mathcal {B}_{21}\times \mathcal {B}_{22}\), so that \(W_{v}^{\tau }\) is constant on this restriction. In the pathological case (iii), we introduce a subset such that the restriction of \(W_u\) is a \(M_k\times k_1\)-step function and the restriction of \(W_v^{\tau }\) is a \(k_1\times M_k\)-step function.
Case (i) We focus our attention on coordinates (x, y) in \(\mathcal {B}_{11}\times \mathcal {B}_{22}\). Recall that the cumulative distribution function G is defined by \(G(0)=1/2\) and \(G(b)=1/2+b/(2M_k)\) for \(b\in [M_k]\). For any \((r,s)\in [M_k]^2\), define
In other words, \(\omega _{r,s}\) stands for the weight of indices corresponding to class r in \(W_u\) and class s in \(W_{v}^{\tau }\). By definition of \(\omega _{r,s}\), for any \(r\in [M_k]\), we have
Let \(\mathcal {R}\) denote the sets of \(r\in [M_k]\) such that \([G(r-1),G(r))\) has a large intersection with \(\tau ^{-1}([1/2,1]\):
Denote \(\bar{\mathcal {R}}\) the complementary set of \(\mathcal {R}\). We have that \(\lambda (\mathcal {B}_{22})=1/2 - \lambda (\mathcal {B}_{12})\ge 1/2 - k_1\epsilon /64\ge \tfrac{27}{56}\) for \(k_1\epsilon \) small enough. Hence, it follows that
which implies that \(|\mathcal {R}|\ge 3 M_k/4\) and \(\lambda (\mathcal {Y})=\sum _{r\in \mathcal {R}} \omega _{r\bullet }\ge 9/28\).
Now, denoting \(\mathcal {X}:=\mathcal {B}_{11}\), we define a new kernel \(\overline{W}_v^{\tau }{:}\,\mathcal {X}\times \mathcal {Y}\rightarrow [0,1]\) by
We can view \(\overline{W}_v^{\tau }\) as a smoothed version of the restriction of \(W_v^{\tau }\) to \(\mathcal {X}\times \mathcal {Y}\). The marginal functions \(\overline{W}_v^{\tau }(x,\cdot )\) are step functions with at most \(|\mathcal {R}|\le M_k\) steps of the form \([G(r-1),G(r))\cap \mathcal {B}_{22}\). Moreover, on each interval \([G(r-1),G(r))\cap \mathcal {B}_{22}\), \(\overline{W}_v^{\tau }(x,y)\) is equal to the mean of \(W_v^{\tau }(x,z)\) for z ranging on this set. Equipped with this notation, we can control the cut distance between \(W_u\) and \(W_v^{\tau }\) in terms of the \(l_1\) distance between the restriction of \(W_u\) to \(\mathcal {X}\times \mathcal {Y}\) and \(\overline{W}_{v}^{\tau }\). For ease of notation, we still write \(W_u\) for the restriction of \(W_u\) to \(\mathcal {X}\times \mathcal {Y}\) when there is no ambiguity.
The following lemma provides a lower bound of the cut norm \(\Vert W_u-W_v^{\tau }\Vert _{\square }\) in terms of the \(l_1\) norm of \(\Vert W_u-\overline{W}_{v}^{\tau }\Vert _{1}\).
Lemma 15
For any u, v in \(\mathcal {C}\) and any measure-preserving transformation \(\tau \), we have
where \(\overline{W}_{v}^{\tau }\) is defined in (92).
In view of Lemma 15 it is enough to control the \(l_1\) norm \(\Vert W_u-\overline{W}_{v}^{\tau }\Vert _{1}\). We can do it in a similar way as it is done in the proof of Lemma 4.5 in [26]. For \(a\ne b\) and any \(x\in \left[ F_u(a-1),F_u(a)\right) \cap \mathcal {X}\) and \(x'\in \left[ F_u(b-1),F_u(b)\right) \cap \mathcal {X}\), the inner product between \(W_u(x,\cdot )\) and \( W_v(x',\cdot )\) satisfies
where we used (81) in the last line. For any \(a,b\in [k_1]\), let \(\psi _{ab}\) denote the Lebesgue measure of the set
Since \(\tau \) is measure preserving, it follows that \(\sum _b \psi _{ab}\le 1/(2k_1)+u_a\) and \(\sum _a \psi _{ab}\le 1/(2k_1)+v_b\). For any \(y\in \mathcal {Y}\), we set
Equipped with this notation, we have
Now take any \(a_1\ne a_2\). By (94), \(|h_{u,a}(y)|=1/2\) and using the triangle inequality, we derive that
where we used \(\lambda (\mathcal {Y})\ge 9/28\) in the last line. As a consequence, for any \( b\in [k_1]\) there exists at most one \(a\in [k_1]\) such that \(\Vert h_{u,a}- k_{v,b}\Vert _1< 1/224\). If such index a exists, it is denoted by \(\pi (b)\). Then, it is possible to extend \(\pi \) as a function from \([k_1]\) to \([k_1]\). Since \(\sum _{a,b}\psi _{a,b}=\lambda (\mathcal {X})\), we get
since \(\lambda [\mathcal {B}_{1,2}]\le k_1\epsilon / 64\). If the sum \(\sum _{b=1}^{k_1} 1/(2k_1)+ v_b -\psi _{\pi (b),b}\) is greater than \(k_1\epsilon /32\), then (87) is satisfied. Thus, we can assume in the sequel that \(\sum _{b=1}^{k_1}1/(2k_1)+ v_b -\psi _{\pi (b),b}\le k_1\epsilon /32\).
Using that \(\psi _{a,b}\le (1/(2k_1)+u_a)\wedge (1/(2k_1)+v_b)\) and that the cardinality of the collection \(\{b\in [k_1]:\, v_b>0\}\) is \(k_1/2\) we deduce that the collection \(\{b\in [k_1]:\, v_b>0,\ u_{\pi (b)}>0\text { and } \psi _{\pi (b),b}\ge 1/(2k_1)\}\) has cardinality greater than \(7k_1/16\). Now, Lemma 12 implies that \(|\mathcal {A}_u\cap \mathcal {A}_v|\le 3k_1/8\) for \(u\ne v\in \mathcal {C}\). Then, there exist subsets \(A\subset \mathcal {A}_{u}\) and \(B\subset \mathcal {A}_{v}\) of cardinality \(\eta _0k_1\) (recall that \(\eta _0=1/16)\) such that \(\pi (B)=A\), \(A\cap B=\emptyset \), and \(\psi _{\pi (b),b}\ge 1/(2k_1)\) for all \(b\in B\). The condition \(\psi _{\pi (b),b}\ge 1/(2k_1)\) implies that \(\pi \) is injective on B. Hence,
where the second inequality follows from \(\psi _{\pi (b),b}\ge 1/(2k_1)\) and the fact that \(h_{u,\pi (b)}\) and \(k_{v,b}\) are step functions with steps larger than \(3/(7 M_k)\) [see (90), the definition of \(\mathcal {R}\) and \(\mathcal {Y}\)]. Finally, we apply the property (82) of \(\varvec{B}\) to conclude that
which, together with Lemma 15, proves (87).
Case (ii). Now we assume that \(k_1\epsilon /64<\lambda (\mathcal {B}_{12})<1/2 - k_1\epsilon /64\). Take \(\mathcal {X}= \mathcal {B}_{21}\) and \(\mathcal {Y}= \mathcal {B}_{22}\). We have that, on \(\mathcal {X}\times \mathcal {Y}\), \(W_v^{\tau }\) is constant and equals 1 / 2. Denote U the restriction of \(W_u-1/2\) to \(\mathcal {X}\times \mathcal {Y}\). Then, it follows that \(\Vert W_u-W_v^{\tau }\Vert _{\square }\ge \Vert U\Vert _{\square }\). The kernel U is at most \(k_1\times M_k\) step function. By Lemma 4, we obtain
where the last equality follows from (89). Using \(\lambda (\mathcal {X})=\lambda (\mathcal {B}_{12})\) and \(x(1/2-x)\ge 1/4\min \left( x,(1/2-x)\right) \) we obtain (87).
Case (iii) Now we assume that \(\lambda (\mathcal {B}_{12})\ge 1/2 - k\epsilon /64\) and take \(\mathcal {X}=\mathcal {B}_{21}\) and \(\mathcal {Y}= \mathcal {B}_{12}\) so that \(\lambda (\mathcal {X})=\lambda (\mathcal {B}_{12})\ge 1/2-k_1\epsilon /64\). Define the smoothed kernel \(\overline{W}_v^{\tau }{:}\,\mathcal {X}\times \mathcal {Y}\rightarrow [0,1]\) by
As a consequence, \(\overline{W}_v^{\tau }\) is \(M_k\times M_k\) block-constant on subsets of the form \( \big (\tau ^{-1}[G(a-1),G(a))\cap \mathcal {X}\big ) \times \big ([G(b-1),G(b))\cap \mathcal {Y}\big )\). Arguing as in the proof of Lemma 15, we derive that
For any a such that \([F_u(a-1),F_u(a))\cap \mathcal {X}\ne \emptyset \) define the function \(h_{u,a}\) on \(\mathcal {Y}\) by \(h_{u,a}(y):= W_u(F_u(a-1),y)-1/2\). Arguing as in Case (i), we observe that \(\Vert h_{u,a_1}-h_{u,a_2}\Vert _1 \ge 1/112\) for any \(a_1\ne a_2\). We have that the kernel \(\overline{W}_v^{\tau }\) is a \(M_k\times M_k\) step function. Hence, there exists a partition \((\mathcal {X}_b)_{b=1,\ldots , M_k}\) of \(\mathcal {X}\) and \(M_k\) functions \(k_{b}(y)\) such that \(\left( \overline{W}_v^{\tau }-1/2\right) (x,y)= \sum _{b=1}^{M_k}\mathbb {1}_{x\in \mathcal {X}_b} k_{b}(y)\). Then, the triangular inequality ensures that, for any \(a_1\ne a_2\) and any \(b\in [M_k]\), we have \(\Vert h_{u,a_1}-k_{b}\Vert _1 + \Vert h_{u,a_1}-k_{b}\Vert _1\ge \Vert h_{u,a_1}-h_{u,a_2}\Vert _1\ge 1/112\). As a consequence, for any \(b\in [M_k]\) there exists at most one a, which we will denote by \(\pi (b)\), such that \(\Vert h_{u,\pi (b)}-k_{b}\Vert _1\le 1/224\). Now we compute
where we used \(\lambda (\mathcal {X})\ge 1/4\), \(M_k/k\le 1/8\), and that \(M_k\epsilon \le k\epsilon \) is small enough. Together with (95), we obtain the desired result (87). \(\square \)
Proof of Lemma 15
We first prove that \(\Vert W_u-\overline{W}_{v}^{\tau }\Vert _{\square }\le \Vert W_u-W_{v}^{\tau }\Vert _{\square }\). Fix any measurable subset \(S\subset \mathcal {X}\). Since functions \(\left[ W_{u}-\overline{W}_{v}^{\tau }\right] (x,\cdot )\) are constant on each set \([G(r-1),G(r))\cap \mathcal {Y}\), the supremum \( \sup _{T\subset \mathcal {Y}}\left| \int _{S\times T}W_u(x,y)-\overline{W}_{v}^{\tau }(x,y)dx dy\right| \) is achieved by a subset T which is an union of some of \([G(r-1),G(r))\cap \mathcal {Y}\), that is \(T=\cup _{r\in \mathcal {R}'\subset \mathcal {R}}[G(r-1),G(r))\cap \mathcal {Y}\). For such T, the definition (92) of \(\overline{W}^{\tau }_{v}\) implies \( \int _{S\times T}\overline{W}_{v}^{\tau }(x,y)dx dy=\int _{S\times T}W_{v}^{\tau }(x,y)dx dy\) so that
Taking the supremum over all S leads to \(\Vert W_u-\overline{W}_{v}^{\tau }\Vert _{\square }\le \Vert W_u-W_{v}^{\tau }\Vert _{\square }\). By definition of \(W_u\) and \(\overline{W}_{v}^{\tau }\) we have that U is a \( k_1^2\times M_k\) step function. Then, Lemma 4 allows us to conclude
\(\square \)
Proof of Lemma 14
The proof of Lemma 14 follows the lines of the proof of Lemma 4.3 in [26] and we give it here for completeness. For \(u\in \mathcal {C}_0\), let \(\zeta (u)=(\zeta _1(u),\ldots ,\zeta _n(u))\) be the vector of n i.i.d. random variables with the discrete distribution on \([k_{1}+M_{k}]\) defined by
Let \(\varvec{\Theta }_0\) be the \(n\times n\) symmetric matrix with elements \((\varvec{\Theta }_0)_{ii}=0\) and \((\varvec{\Theta }_0)_{ij}=\rho _n \varvec{Q}_{\zeta _i(u),\zeta _j(u)}\) for \(i\ne j\). Assume that, conditionally on \(\zeta (u)\), the adjacency matrix \(\varvec{A}\) is sampled according to the network sequence model with such probability matrix \(\varvec{\Theta }_0\). Notice that in this case the observations \(\varvec{A}'=(\varvec{A}_{ij}, 1\le j<i\le n)\) have the probability distribution \({\mathbb {P}}_{W_u}\). Using this remark and introducing the probabilities \(\alpha _{\varvec{a}} (u) = {\mathbb {P}}[\zeta (u)=\varvec{a}] \) and \(p_{A\varvec{a}}={\mathbb {P}}[\varvec{A}'=A\vert \zeta (u)=\varvec{a}]\) for \(\varvec{a}\in [k_1+M_k]^n\), we can write the Kullback–Leibler divergence between \({\mathbb {P}}_{W_u}\) and \({\mathbb {P}}_{W_v}\) in the form
where the sums in \(\varvec{a}\) are over \([k_1+M_k]^n\) and the sum in A is over all triangular upper halves of matrices in \( \{0,1\}^{n\times n }\). Since the function \((x,y) \mapsto x\log (x/y)\) is convex we can apply Jensen’s inequality to get
where the last equality follows from the fact that \(\alpha _{\varvec{a}} (u)\) are n-product probabilities. Using (96) we get
which is equal to n / 2 times the Kullback–Leibler divergence between two discrete distribution. Since the Kullback–Leibler divergence is less than the chi-square divergence we obtain
where last inequality we use \(|v_a|\le \epsilon \le 1/(8k_1)\), and \(|u_a-v_a|\le 2\epsilon \). Combining this with (97) proves the lemma. \(\square \)
G Proof of Proposition 7
To prove (24), it is enough to prove separately the following three minimax lower bounds:
The proof of (98) follows from the proof of (43) in [26] using the trivial inequality
The proof of (99) follows the lines of the proof of (44) using that \(\Vert \mathbf {B}\Vert ^{2}_{2}=\Vert \mathbf {B}\Vert _{1}\) for matrices with entries in \(\{-1,1\}\). The proof of (100) is identical to the proof of (45) in [26].
In order to prove the upper bound (25), the proof of Proposition 3.2 in [26] can be easily modified to get an upper bound on the agnostic error measured in \(l_1\)-distance:
Lemma 16
(Agnostic error measured in \(l_1\)-distance) Consider the W-random graph model. For all integer \(k\le n\), \(W_0\in \mathcal {W}^+[k]\) and \(\rho _n>0\), we have
Now (25) follows from Lemma 16 and (16). Finally, the \(\rho _n\) convergence rate is simply achieved by the constant estimator \(\widehat{f}\equiv 0\).
H Proof of Proposition 8
For \(\varvec{\Theta }_0\) generated according to the sparse W-random graph model (26) with graphon \(W_0\in \mathcal {W}^+_{1}\), integrating (9) with respect to \({\varvec{\xi }}\) and using \(\Vert W_0\Vert _{1}=1\), we get
So, using the triangle inequality (20) it is enough to bound the agnostic error \({\mathbb {E}}_{W_0}[\delta _{\square }(\widetilde{f}_{\varvec{\Theta }_0}, f'_0)]\). We take \(W^{*}\in \mathcal {W}^+_1[k,\mu ]\) (or \(W^{*}\in \mathcal {W}^+_2[k]\) in the case of \(L_2\) graphons) such that
or \(\delta _{2}\left( W^{*}, W'_0\right) \le \inf _{W\in \mathcal {W}^+_2[k]}\delta _{2}\left( W, W'_0\right) (1+1/n^{2})\) for \(L_2\) graphons. Without lost of generality we can assume that \(\rho _nW^{*}(x,y)\le 1\). Let \(f^{*}=\rho _nW^{*}\) and \(\varvec{\Theta }^{*}=(\varvec{\Theta }^{*}_{ij})\) be such that for \(i\ne j\)\(\varvec{\Theta }^{*}_{ij}= W^*[\xi _i,\xi _j]\) where \((\xi _i)\) are the same as for \(\varvec{\Theta }_0\). Triangle inequality implies
where we use \(\delta _{\square }( f'_0,f^{*} )\le \delta _{1}( f'_0,f^{*} )\) and \({\mathbb {E}}_{W_0}[\delta _{\square }(\widetilde{f}_{\varvec{\Theta }^{*}}, \widetilde{f}_{\varvec{\Theta }_0})]\le \delta _{1}( f'_0,f^{*} )\) and that \( \widetilde{f}_{\varvec{\Theta }^{*}}\) is distributed as under \(W^*\). Similarly for \(L_2\) graphons, we obtain \({\mathbb {E}}_{W_0}[\delta _{\square }(\widetilde{f}'_{\varvec{\Theta }_0}, f_0)]\le 2\delta _{2}(f'_0,f^{*} )+{\mathbb {E}}_{W* }[\delta _{\square }(f^{*}, \widetilde{f}_{\varvec{\Theta }^{*}} )]\). Then, we use the following lemma:
Lemma 17
-
(i)
Consider any \(W^* \in \mathcal {W}^+_1[k,\mu ]\) and \(\rho _n\ge 1/n\) such that \(\rho _nW^* (x,y)\le 1\). Then
$$\begin{aligned} {\mathbb {E}}_{W^*}\left[ \delta _{\square }\left( \widetilde{f}_{\varvec{\Theta }^{*}}, f^{*}\right) \right] \le C \left[ \rho _n\Vert W^{*}\Vert _{1} \sqrt{\frac{k}{ \mu n}}+\sqrt{\dfrac{\rho _n}{n}}\right] . \end{aligned}$$ -
(ii)
Consider any \(W^{*}\in \mathcal {W}^+_2[k]\) and \(\rho _n\ge 1/n\) such that \(\rho _nW^{*}(x,y)\le 1\). Then,
$$\begin{aligned} {\mathbb {E}}_{W^*}\left[ \delta _{\square }\left( \widetilde{f}_{\varvec{\Theta }^{*}}, f^{*}\right) \right] \le C \left[ \rho _n\Vert W^{*}\Vert _{2} \sqrt{\frac{k}{n}}+\sqrt{\dfrac{\rho _n}{n}}\right] . \end{aligned}$$(103)
Now (27) follows from (i) of Lemma 17 and \(\Vert W^{*}\Vert _{1}\le \Vert W_0\Vert _1(2+n^{-2})\). The proof of (29) follows the same lines using (ii) of Lemma 17.
To prove (28) and (30) we only need to prove that \({\mathbb {E}}_{W_0}\left[ \Vert {\widetilde{\varvec{\Theta }}}_{\lambda }-\varvec{\Theta }_0\Vert _\square \right] \le C\sqrt{\rho _n/n}\). Using the definition of \({\widetilde{\varvec{\Theta }}}_{\lambda }\) (13) we compute
where we used that \(\Vert \varvec{B}\Vert _{\square }\le \Vert \varvec{B}\Vert _{2\rightarrow 2}/n\) and the definition of \(\widetilde{\varvec{\Theta }}_{\lambda }\). This completes the proof of Proposition 8.
Proof of Lemma 17
Consider the matrix \({\varvec{\Theta }}'\) with entries \(({\varvec{\Theta }}')_{ij}=\rho _nW^{*}(\xi _i,\xi _j)\) for all i, j. As opposed to \(\varvec{\Theta }^{*}\), the diagonal entries of \({\varvec{\Theta }}'\) are not constrained to be null. By the triangle inequality, we get
Since the entries of \(\varvec{\Theta }^{*}\) coincide with those of \({\varvec{\Theta }}'\) outside the diagonal, the difference \(\widetilde{f}_{\varvec{\Theta }^{*}}- \widetilde{f}_{{\varvec{\Theta }}'}\) is null outside of a set of measure 1 / n. Also, the entries of \({\varvec{\Theta }}'\) are smaller than 1. It follows that \( {\mathbb {E}}[\delta _{\square }(\widetilde{f}_{\varvec{\Theta }^{*}}, \widetilde{f}_{{\varvec{\Theta }}'})]\le 1/n\le \sqrt{\rho _n/n}\). Since \(\delta _{\square }(\widetilde{f}_{{\varvec{\Theta }}'}, f^{*})\le \delta _1(\widetilde{f}_{{\varvec{\Theta }}'}, f^{*})\), it suffices to prove that
Since \(W^*\) is a k-step function, we can reorganize \(f^*\) and \(\widetilde{f}_{\varvec{\Theta }'}\) in such a way that these two graphon are equal on a set of large Lebesgue value. More precisely, we adopt the same approach as in the proof of Theorem 1 and we only sketch the result here. Let \(\varvec{Q}\in (\mathbb {R}^+)^{k\times k}_{sym}\) and \(\phi {:}\,[0,1]\times [k]\) that characterize \(W^*\). For \(a=1,\ldots , k\), denote \(\lambda _a=\lambda (\phi ^{-1}(\{a\}))\). For any \(b\in [k]\), define the cumulative distribution function \(F_{\phi }(b)=\sum _{a=1}^{b} \lambda _a\) and set \(F_{\phi }(0)=0\). For any \((a,b)\in [k]\times [k]\) define \(\Pi _{ab}(\phi )=[F_{\phi }(a-1),F_{\phi }(a))\times [F_{\phi }(b-1),F_{\phi }(b))\). Define \(W'(x,y)= \sum _{a=1}^k \sum _{b=1}^k \varvec{Q}_{ab} \mathbb {1}_{\Pi _{ab}(\phi )}(x,y)\). Obviously, \(f'=\rho _n W'\) is weakly isomorphic to \(f^{*}=\rho _n W^{*}\). Now, let \(\widehat{\lambda }_a=\frac{1}{n} \sum _{i=1}^n \mathbb {1}_{\{ \xi _i\in \phi ^{-1}(a)\}}\) be the (unobserved) empirical frequency of group a. Consider a function \(\psi {:}\,[0,1]\rightarrow [k]\) such that:
-
(i)
\(\psi (x)= a\) for all \(a\in [k]\) and \(x\in [F_{\phi }(a-1), F_{\phi }(a-1)+ \widehat{\lambda }_a\wedge \lambda _a)\),
-
(ii)
\(\lambda (\psi ^{-1}(a))= \widehat{\lambda }_a\) for all \(a\in [k]\).
Such a function \(\psi \) exists (for details see the Step 2 of the proof of Theorem 1). Finally define the graphon \(\widehat{f}'(x,y)= \varvec{Q}_{\psi (x),\psi (y)}\). Notice that \(\widehat{f}'\) is weakly isomorphic to the empirical graphon \(\widetilde{f}_{\varvec{\Theta }^{*}}\). Since \(\delta _1(\cdot ,\cdot )\) is a metric on the quotient space of graphons, we have
The two functions \(f'(x,y)\) and \(\widehat{f}'(x,y)\) are equal except possibly the case when either x or y belongs to one of the intervals \([F_{\phi }(a-1)+ \widehat{\lambda }_a\wedge \lambda _a, F_{\phi }(a-1)+\lambda _a)\) for \(a\in [k]\) and we have
Since \(\xi _1,\ldots ,\xi _n\) are i.i.d. uniformly distributed random variables, \(n\widehat{\lambda }_a\) has a binomial distribution with parameters (n, \(\lambda _a\)). By Cauchy–Schwarz inequality we get \({\mathbb {E}}[|\lambda _a-\widehat{\lambda }_a|]\le \sqrt{\lambda _a(1-\lambda _a)/n}\) and \(\mathbb {E}(\vert \lambda _a-{\widehat{\lambda }}_a\vert \vert \lambda _b-{\widehat{\lambda }}_b\vert )\le \sqrt{\lambda _a\lambda _b}/n\). Then, we get
Now for \(W^{*}\in \mathcal {W}^+_{1}[k,\mu ]\) we use \(\lambda _{a}\ge \mu /k\) for all \(a\in [k]\) to get
since we assume that \(k\le \mu n\). For \(W^{*}\in \mathcal {W}^+_2[k]\) we use the Cauchy–Schwarz inequality:
since \(k\le n\). \(\square \)
Rights and permissions
About this article
Cite this article
Klopp, O., Verzelen, N. Optimal graphon estimation in cut distance. Probab. Theory Relat. Fields 174, 1033–1090 (2019). https://doi.org/10.1007/s00440-018-0878-1
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00440-018-0878-1