Graph sampling

We synthesise the existing theory of graph sampling. We propose a formal definition of sampling in finite graphs, and provide a classification of potential graph parameters. We develop a general approach of Horvitz–Thompson estimation to T-stage snowball sampling, and present various reformulations of some common network sampling methods in the literature in terms of the outlined graph sampling theory.


Introduction
Many technological, social and biological phenomena exhibit a network structure that may be the interest of study; see e.g. Newman [20]. As an example of technological networks, consider the Internet as consisting of routers that are connected to each other via cables. There are two types of objects, namely routers and cables. A router must be connected to a cable to be included in the Internet, and a cable must have two routers at both ends. As another example, consider the social network of kinships. Again, there are two types of objects, namely persons and kinships. Each person must have two or more kinships, and each kinship must represent a connection between two persons. However, while it is obvious that any two routers must be connected by cables to each other either directly or via other routers in the Internet, it is not sure that any two persons can be connected to each other in the network of kinships. The difference can be articulated in terms of some appropriate characterisation of their respective network structures.
Following Frank [11,12,14], we refer to network as a valued graph, and graph as the formal structure of a network. The structure of a network, i.e. a graph, is defined as a collection of nodes and edges (between the nodes); measures may be attached to the nodes or the edges or both to form a valued graph, i.e. a network. For a statistical approach to networks one may choose to model the entire population network as a random realisation [16], or to exploit the variation over possible sample networks taken from a given fixed population network. Graph sampling theory deals with the structure of a network under the latter perspective. In comparison, finite-population sampling [3,21] can mostly be envisaged as sampling in a 'graph' with no edges at all. We shall refer to such a setting as list sampling.
Ove Frank has undoubtedly made the most contributions to the existing graph sampling theory. See e.g. Frank [8,10,[12][13][14] for his own summary. However, the numerous works of Frank scatter over several decades, and are not easily appreciable as a whole. For instance, Frank derives results for different samples of nodes [5,8,15], dyads [5][6][7]10] or triads [5,10]. But he never proposes a formal definition of the "sample graph" which unifies the different samples. Or, Frank studies various characteristics of a graph, such as order [5,8,15], size [5][6][7]10], degree distribution [5,11], connectedness [5,9], etc. But he never provides a structure of possible graph parameters which allows one to classify and contrast the different interests of study. Finally, Frank does not appear to have articulated the role of graph sampling theory in relation to some common "network sampling methods" (e.g. [1,19,24]), which "are not explicitly stated as graph problems but which can be given such formulations" [8].
The aim of this paper is to synthesise and extend the existing graph sampling theory, many elements of which are only implicit in Frank's works. In particular, we propose a definition of sample graph taken from a given population graph, together with the relevant observation procedures that enable sampling in a graph (Sect. 2). In Sect. 3, we provide a structure of graph totals and graph parameters, which reflects the extended scope of investigation that can be difficult or impossible using only a list representation. Next, we develop a general approach to HT-estimation under arbitrary T -stage snowball sampling (Sect. 4). In Sect. 5, we present various graph sampling reformulations of multiplicity sampling [1], indirect sampling [19] and adaptive cluster sampling [24], all of which are referred to as unconventional sampling methods in contrast to the more familiar finite-population sampling methods, such as stratified multi-stage sampling. Finally, some concluding remarks are given in Sect. 6, together with a couple of topics of current research.

Sampling on a graph 2.1 Terms and notations
A graph G = (U, A) consists of a set of nodes U and a set of edges A. Define |U | = N and |A| = R as the order and size of G, respectively. Let A i j ⊂ A be the set of all edges from i to j; let a i j = |A i j | be its size. If a i j > 1 for some i, j ∈ U , the graph is called a multigraph; it is a simple graph if a i j = 0, 1. The edges in A i+ = j∈U A i j and A +i = j∈U A ji are called the outedges and inedges at i, respectively. Let a i+ = |A i+ | = j∈U a i j and a +i = |A +i | = j∈U a ji . The node i is incident to each outedge or inedge at i. The number of edges incident at a node i is called the degree of i, denoted by d i = a i+ + a +i . Two nodes i and j are adjacent if there exists at least one edge between them, i.e. a i j + a ji > 1. For any edge in A i j , i is called its initial node and j its terminal node. Let α i be the successors of i, which are the terminal nodes of outedges at i; let β i be the predecessors of i, which are the initial nodes of inedges at i. For a simple graph, we have a i+ = |α i | and a +i = |β i |. A graph is said to be directed (i.e. a digraph) if A i+ = A +i ; it is undirected if A i+ = A +i , in which case there is no distinction between outedge and inedge, so that d i = a i+ = a +i , and α i = β i . Finally, an edge a ii connecting the same node i is called a loop, which can sometimes be a useful means of representation. Whether or not loops are included in the definitions of the terms and notations above is purely a matter of convention.
Remark Adjacency refers to relationship between nodes, as objects of the same kind; incidence refers to relationship between nodes and edges, i.e. objects of different kinds.
Remark Let the N × N adjacency matrix A have elements a i j = |A i j |. It is defined to be symmetric for undirected graphs. Put the diagonal degree matrix D = diag(A1 N ×1 ). The Laplacian matrix L = D − A sums to 0 by row and column, which is of central interest in Spectral Graph Theory (e.g. [2]).

Definition of sample graph
Denote by s 1 an initial sample of nodes, for s 1 ⊆ U . Under a probability design, let π i and π i j (orπ i andπ i j ) be the probabilities of inclusion (or exclusion) of respectively a node and a pair of nodes in s 1 . (The exclusion probability of i is the probability of i / ∈ s 1 , and the exclusion probability of a pair (i, j) is the probability that neither i nor j is in s 1 .) A defining feature of sampling on graphs is that one makes use of the edges to select the sample graph, denoted by G s . Given s 1 , the relevant edges are either in α(s 1 ) = i∈s 1 α i or β(s 1 ) = i∈s 1 β i , where α(s 1 ) = β(s 1 ) for undirected graphs. An observation procedure of the edges needs to be specified, and the observed edges can be given in terms of a reference set of node pairs, denoted by s 2 where s 2 ⊆ U × U , under the convention that the set of edges A i j are observed whenever (i j) ∈ s 2 . Notice that generally speaking (i j) and ( ji) are considered as two distinct elements in U × U . Denote by π (i j) (orπ (i j) ) the corresponding inclusion (or exclusion) probability of (i j) ∈ s 2 , and by π (i j)(kl) (orπ (i j)(kl) ) the inclusion (or exclusion) probability of these two pairs in s 2 . Denote by A s = A(s 2 ) the edge set inherent of s 2 , and U s = s 1 ∪ Inc(A s ) the union of s 1 and those nodes that are incident to A s . The sample graph is G s = U s , A s .
where α(s 1 ) = {2} in this case, the sample graph G s has A s = A(s 2 ) = A 12 and U s = {1, 2}. The same sample graph can equally be given by Observation procedure Frank [8] considers several observation procedures, which can be formalised as follows. First, given s 1 , a procedure is induced if A i j is observed iff both i ∈ s 1 and j ∈ s 1 , or incident reciprocal if A i j and A ji are both observed provided either i ∈ s 1 or j ∈ s 1 . Second, for digraphs, an incident non-reciprocal procedure is forward if A i j is observed provided i ∈ s 1 , or backward if A i j is observed provided j ∈ s 1 . For example, provided i ∈ s 1 and j / ∈ s 1 and a i j > 0 and a ji > 0, we would observe both A ji and A ji given an incident reciprocal procedure; only A i j if it is incident forward; only A ji if it is incident backward; neither A i j nor A ji given an induced procedure from s 1 .
Initial sampling of edges Sample graph initiated by a sample of edges can be defined analogously. Bernoulli or Poisson sampling can be useful, because it is not required to know all the edges in advance. Notice that when one is interested in the totals or other functions of the edges of a graph, initial Bernoulli or Poisson sampling of edges is meaningful-see e.g. Frank [8,Section 12], whereas initial simple random sampling (of edges) would have been a trivial set-up, because one would need to know all the edges to start with.

Some graph sampling methods
We describe some sampling methods based on the aforementioned observation procedures. Frank [8] elicited several sampling methods based on the aforementioned observation procedures. We include several alternative specifications which are marked by †. By way of introduction, the first-and second-order inclusion probabilities of (i j) in s 2 are given in terms of the relevant inclusion probabilities in s 1 , which facilitates Horvitz-Thompson (HT) estimation of any totals defined on U × U . As will be illustrated, given s 1 and the observation procedure, the sample graph can be specified using different reference sets s 2 , but the inclusion probabilities are more readily obtained for some choices of s 2 .
Remark The sample edge set A(s 2 ) is the same in (ii.2) and (ii.1), because the observation procedure is the same given s 1 . For the estimation of any total over A, the two reference sets would yield the same HT-estimate: any (i j) ∈ s 2 with a i j = 0 does not contribute to the estimate, regardless of its π (i j) ; whereas for any (i j) ∈ s 2 with a i j > 0, we have π (i j) = π i given s 2 in (ii.2), just as one would have obtained in (ii.1) since B j = B j ∪ {i} provided a i j > 0. But it appears easier to arrive at π (i j) and the HT-estimator in (ii.2) than (ii.1).
: This is the smallest Cartesian product that contains the same sample edge set as in (ii.1) and (ii.2).
This is the smallest reference set for the same G s in (ii.1)-(ii.4). (iii) s 2 = s a × s a , s a = α(s 1 ) ∪ s 1 [Induced from s a ]: (i j) ∈ s 2 even if i ∈ s a \s 1 and j ∈ s a \s 1 . Similarly to (ii.1), (i j) ∈ s 2 iff B i ∩ s 1 = ∅ and B j ∩ s 1 = ∅, and so on. Then, Remark Observation of the edges between i ∈ s a \s 1 and j ∈ s a \s 1 may be demanding in practice, even when the observation procedure is reciprocal. For example, let the node be email account. Then, by surveying i ∈ s 1 only, it is possible to observe all the email accounts that have exchanges with i due to reciprocality. But one would have to survey the accounts in α i \s 1 additionally, in order to satisfy the requirement of (iii).
(iv.1) s 2 = s 1 × U ∪ U × s 1 [Incident reciprocal]: (i j) / ∈ s 2 iff i / ∈ s 1 and j / ∈ s 1 . Then, π (i j) = 1 −π i j and π (i j)(kl) = 1 −π i j −π kl +π i jkl . (iv.2) † s 2 = s 1 × s a ∪ s a × s 1 , s a = α(s 1 ) ∪ s 1 [Incident reciprocal]: We have s a × s a = s 2 ∪ (s a \s 1 ) × (s a \s 1 ), where the two sets on the right-hand side are disjoint. The inclusion probabilities can thus be derived from those in (iii) and those of (s a \s 1 ) × (s a \s 1 ). However, the sample edge set A(s 2 ) is the same as in (iv.1), and it is straightforward to derive the HT-estimator of any total over A based on the reference set s 2 specified in (iv.1).
: This is the smallest reference set of the sample edge set in (iv.1)-(iv.3). Figure 1 illustrates the four sampling methods (i)-(iv) described above, all of which are based on the same initial sample s 1 = {3, 6, 10}.

Graph parameter and HT-estimation
Frank [12] reviews some statistical problems based on population graphs. In a list representation, the target population U is a collection of elements, which are associated with certain values of interest. In a graph representation G = (U, A), the elements in U can be the nodes that have relations to each other, which are presented by the edges in A. It becomes feasible to investigate the interactions between the elements, their structural positions, etc. which are difficult or unnatural using a list representation. The extended scope of investigation is above all reflected in the formulation of the target parameter. In this Section, we provide our own classification of the potential target parameters based on a graph in terms of graph totals and graph parameters.
Graph total and graph parameter Let M k be a subset of U , where |M k | = k. Let C k be the set of all possible M k 's, where |C k | = N ![k!(N − k)!] −1 . Let G(M k ) be the subgraph induced by M k . Let y G(M k ) , or simply y(M k ), be a function of integer or real value. The corresponding kth order graph total is given by We refer to functions of graph totals as graph parameters.
Remark Network totals can as well be defined by (1), where y(·) can incorporate the values associated with the nodes and edges of the induced subgraph G(M k ).
Motif A subset M ⊂ U with specific characteristics is said to be a motif, denoted by [M].

First-order graph total: M 1 = {i}
Each M 1 corresponds to a node. In principle any first-order graph total can be dealt with by a list sampling method that does not make use of the edges, against which one can evaluate the efficiency of any graph sampling method. For the two parameters given below, estimation of the order by snowball sampling is considered by Frank [5,8,15], and estimation of the degree distribution is considered by Frank [5,11].
Then, θ is the number of nodes with degree d.

Second-order graph total: M 2 = {i, j}
An M 2 of a pair of nodes is called a dyad, for M 2 ⊂ U and |M 2 | = 2. Some dyad totals are considered by Frank [5,10]. Then, R = θ + θ is a graph parameter based on a 1st-and a 2nd-order graph totals.
Remark Let N d be the no. degree-d nodes, which is a 1st-order graph total. Then, This is an example where a higher-order graph total (R) can be 'reduced' to lower-order graph parameters (N d ). Such reduction can potentially be helpful in practice, e.g. when it is possible to observe the degree of a sample node without identifying its successors.

Number of adjacent pairs
Let y(M 2 ) = δ(a i j + a ji > 0) indicate whether i and j are adjacent. Then, θ is the total number of adjacent pairs in G. Its ratio to |C 2 | provides a graph parameter, i.e. an index of immediacy in the graph. Minimum immediacy is the case when a graph consists of only isolated nodes, and maximum immediacy if the graph is a clique, where every pair of distinct nodes are adjacent with each other.
Number of mutual relationships Let y(M 2 ) = δ(a i j a ji > 0) indicate whether i and j have reciprocal edges between them, in which case their relationship is mutual. Then, θ is the number of mutual relationships in the graph. Goodman [17] studies the estimation of the number of mutual relationships in a special digraph, where a i+ = 1 for all i ∈ U .
Number of triads Let y(M 3 ) = δ(a i j a jh a ih > 0) indicate whether the three nodes form a triangle in an undirected graph. Then, θ * by (1) is the total number of triangles. Triangles on undirected graphs are intrinsically related to equivalence relationships: for a relationship (represented by an edge) to be transitive, every pair of connected nodes must be adjacent; hence, any three connected nodes must form a triangle. For a simple undirected graph, transitivity is the case iff θ = 0, when θ is given by (1), where Provided this is not the case, one can e.g. still measure the extent of transitivity by i.e. a graph parameter. Next, for digraphs and ordered ( jih), let z( jih) = a ji a ih a hj be the count of strongly connected triangles from j via i and h back to j. Let M 3 contain all the possible orderings of Then, the number of strongly connected triangles in a digraph is given by (1), where Remark For undirected simple graphs, Frank [13] shows that there exists an explicit relationship between the mean and variance of the degree distribution and the triads of the graph. Let the numbers of triads of respective size 3, 2 and 1 be given by

Graph totals of unspecified order
A motif is sometimes defined in an order-free manner. Insofar as the corresponding total can be given as a function of graph totals of specific orders, it can be considered a graph parameter. Below are some examples that are related to the connectedness of a graph. The number of connected components is considered by Frank [5,9].

Number of connected components
The subgraph induced from M k is a connected component of order k, provided there exists a path for any i = j ∈ M k and a i j = a ji = 0 for any i ∈ M k and j / ∈ M k , in which case let y(M k ) = 1 but let y(M k ) = 0 otherwise. Then, θ k given by (1) is the number of connected components of order k. The number of connected components (i.e. as a motif of unspecified order) is the graph parameter given by θ = N k=1 θ k . At one end, where A = ∅, i.e. there are no edges at all in the graph, we have θ = N = θ 1 and θ k = 0 for k > 1. At the other end, where there exists a path between any two nodes, we have θ = θ N = 1 and θ k = 0 for k < N .

Number of trees in a forest
As an example where θ can be reduced to a specific graph total, suppose the undirected graph is a forest, where every connected component is a tree. We have then θ = N − R, where R is the size of the graph, which is a 2nd-order parameter.
Number of cliques A clique is a connected component, where there exists an edge between any two nodes of the component. It is a motif of unspecified order. The subgraph induced by a clique is said to be complete. A clustered population can be represented by a graph, where each cluster of population elements (i.e. nodes) form a clique, and two nodes i and j are adjacent iff the two belong to the same cluster.
Index of demographic mobility Given the population of a region (U ), let there be an undirected edge between two persons i and j if their family trees intersect, say, within the last century, i.e. they are relatives of each other within a 'distance' of 100 years. Each connected component in this graph G is a clique. The ratio between the number of connected components θ and N , where N is the maximum possible θ , provides an index of demographic mobility that varies between 1/N and 1. Alternatively, an index can be given by the ratio between the number of edges R and |C 2 |, which varies between 0 and 1, and is a function of a 2nd-order graph total. This is an example where the target parameter can be specified in terms of a lower-order graph total than higher-order totals.
Remark In the context of estimating the number of connected components, Frank [5] discusses the situation where observation is obtained about whether a pair of sample nodes are connected in the graph, without necessarily including the paths between them in the sample graph. The observation feature is embedded in the definition of the graph here.

Geodesics in a graph
Let an undirected graph G be connected, i.e. U = M N is a connected component. The geodesic between nodes i and j is the shortest path between them, denoted by [M k ], where M k contains the nodes on the geodesic, including i and j. A geodesic [M k ] is a motif of order k, whereas geodesic is generally a motif of unspecified order. Let θ be the harmonic mean of the length of the geodesics in G, which is a closeness centrality measure [20]. For instance, it is at its minimum value 1 if G is complete. Alternatively, let y( is the geodesic between i and j, so that θ can equally be given as a 2nd-order graph parameter. Again, this is an example where a lower-order graph parameter can be used as the target parameter instead of alternatives involving higher-order graph totals, provided the required observation.

HT-estimation
A basic estimation approach in graph sampling is the HT-estimator of a graph total (1). Provided the inclusion probability π (M k ) for M k ∈ C k , the HT-estimator is given by where means not only M k ⊆ U s , but also it is possible to identify whether M k is a particular motif in order to compute y(M k ). The probability π (M k ) is defined with respect to a chosen reference set s 2 and the corresponding sample graph G s . It follows that a motif More detailed explanation of π (M k ) will be given in Sect. 4. The example below illustrates the idea.
Example 3 Consider an undirected simple graph. Let 3-node star be the motif of interest, and Suppose incident observation and To be able to identify whether it is the motif of interest, all the three pairs (i j), (ih) and ( jh) need to be in s 2 ; accordingly, An example where this is not the case is i ∈ s 1 and j, h ∈ α(s 1 )\s 1 , so that the observed part of this triad is a star, but one cannot be sure if a jh = 0 in the population graph, because ( jh) / ∈ s 2 .
Symmetric designs The inclusion probability π (M k ) depends on the sampling design of initial s 1 . At various places, Frank consider simple random sampling (SRS) without replacement, Bernoulli sampling and Poisson sampling for selecting the initial sample. In particular, a sampling design is symmetric [6] if the inclusion probability π M k = Pr(M k ∈ s 1 ) only depends on k but is a constant of M k , for all 1 ≤ k ≤ N . SRS with or without replacement and Bernoulli sampling are all symmetric designs. SRS without replacement is the only symmetric design with fixed sample size of distinct elements.
Approximate approach The initial inclusion probability π M k has a simpler expression under Bernoulli sampling than under an SRS design. Provided negligible sampling fraction of s 1 , many authors use Bernoulli sampling with probability p = |s 1 |/N to approximate any symmetric designs. Similarly, initial unequal probability sampling may be approximated by Poisson sampling with the same π i , for i ∈ U , provided negligible sampling fraction |s 1 |/N . Finally, Monte Carlo simulation [4] may be used to approximate the relevant π M k under sampling without replacement.

T -stage snowball sampling
An incident observation procedure (Sect. 2.3) provides the means to enlarge a set of sample nodes by their out-of-sample adjacent nodes. It yields a method of 1-stage snowball sampling, which can be extended successively to yield the T -stage snowball sampling. Below we assume that all the successors are included in the sample. But it is possible to take only some of the successors at each stage (e.g. [23]). In particular, taking one successor each time yields a Tstage walk (e.g. [18]). Two different observation procedures will be considered, i.e. incident forward in digraphs and incident reciprocal in directed or undirected graphs. We develop general formulae for inclusion probabilities under T -stage snowball sampling. It is shown that additional observation features are necessary for the HT-estimator based on T -stage snowball sampling, which will be referred to as incident ancestral. Previously, Goodman [17] has studied the estimation of mutual relationships between i and j, where a i j a ji > 0 for i = j ∈ U , based on T -stage snowball sampling in a special digraph with fixed a i+ ≡ 1; Frank [8] and Frank and Snijders [15] considered explicitly HT-estimation based on 1-stage snowball sampling.
Sample graph G s = (U s , A s ) Let s 1,0 be the initial sample of seeds, and α(s 1,0 ) its successors. Let U 0 ⊆ U be the set of possible initial sample nodes. The additional nodes s 1,1 = α(s 1,0 )\s 1,0 are called the fist-wave snowball sample, which are the seeds of the second-wave snowball sample, and so on. At the tth stage, let s 1,t = α(s 1,t−1 )\ t−1 h=0 s 1,h be the tth stage seeds, for t = 1, 2, . . . , T . If s 1,t = ∅, set s 1,t+1 = · · · = s 1,T = ∅ and terminate, otherwise move to stage t + 1. Let s 1 = T −1 t=0 s 1,t be the sample of seeds. This may result in two different sample graphs.
I Let s 2 = s 1 × U provided incident forward observation in digraphs, such that the sample graph G s has edge set A s = i∈s 1 j∈α i A i j and node set U s = s 1 ∪ α(s 1 ). II Let s 2 = s 1 × U ∪ U × s 1 provided incident reciprocal observation, digraphs or not, such that G s has edge set A s = i∈s 1 j∈α i (A i j ∪ A ji ) and node set U s = s 1 ∪ α(s 1 ).
Remark One may disregard any loops in snowball sampling, because they do not affect the propagation of the waves of nodes, but only cause complications to their definition.

Inclusion probabilities of nodes and edges in G s
Below we develop the inclusion probabilities π (i) and π (i)( j) of nodes in U s , and π (i j) and π (i j)(hl) of edges in A s , under T -stage snowball sampling with s 2 as specified above.

Forward observation in digraphs
The stage-specific seed samples s 1,0 , . . . , s 1,T −1 are disjoint, so that each observed edge, denoted by i j ∈ A s , can only be included at a particular stage.
i be its tth generation predecessors, for t > 0, which consists of the nodes that would lead to i in t-stages from s 1,0 but not sooner. Notice that β [0] i , β [1] i , β [2] i , . . . are disjoint. We have The respective joint inclusion probabilities follow as Incident reciprocal observation Each i j ∈ A s can only be included at a particular stage, where either i or j is in the seed sample, regardless if the graph is directed or not. For i ∈ U , let η i = { j ∈ U ; a i j + a ji > 0} be the set of its adjacent nodes. Let i be its tth step neighbours, for t > 0, which are the nodes that would lead to i in t-stages from s 1,0 but not sooner. We have The respective joint inclusion probabilities follow as π (i)( j) = 1 −π R i −π R j +π R i ∪R j and π (i j)(hl) = 1 −π R i j −π R hl +π R i j ∪R hl . Incident ancestral observation procedure It is thus clear that additional features of the observation procedure is required in order to calculate π (i) and π (i)( j) given any T ≥ 1, or π (i j) and π (i j)(hl) given any T ≥ 2. Reciprocal or not, an incident procedure is said to be ancestral in addition, if one is able to observe all the nodes that would lead to the inclusion of a node i ∈ U s , which will be referred to as its ancestors. These are the predecessors of various generations for forward observation in digraphs, or the neighbours of various steps for reciprocal observation in directed or undirected graphs. Notice that the edges connecting the sample nodes in U s and their out-of-sample ancestors are not included in the sample graph G s . More comments regarding the connections between snowball sampling and some well-known network sampling methods will be given in Sect. 5.
Remark Frank [5] defines the reach at i as the order of the connected component containing node i. The requirement of observing the reach, without including the whole connected component in the sample graph, is similar to that of an ancestral observation procedure, even though the two are clearly different. (3) and (4)

Example 4 To illustrate the inclusion probabilities
where Pr M k ⊆ s 1 = π (i 1 )(i 2 )···(i k ) is joint inclusion probability of the relevant nodes in s 1 , similarly for Pr M  for all i ∈ M k \M k and j ∈ M k \M k . The joint inclusion probability π (M k )(M k ) follows, similarly as above for π (M k ) , as the probability that at least one of these subsets is in the sample of seeds s 1 .
Probability π (i 1 )(i 2 )···(i k ) In the case of k = 2, π (i)( j) is as given earlier in Sect. 4.1. To express π (i 1 )(i 2 )···(i k ) in terms of the probabilities for the initial seed sample s 1,0 , we have where L includes ∅, and |L| is its cardinality, andπ(L) is the exclusion probabilitȳ where R L = i∈L R i and R i = T −1 t=0 η [t] i is the ancestors of i up to the T − 1 steps, and π D is joint inclusion probability of the nodes in D in the initial sample of seeds s 1,0 .

Arbitrary M k with k ≥ 2 and s
By dropping the nodes s 1,T of the last wave of T -stage snowball sampling, we ensure that the motif of any subset M k ∈ s 1 is observable. The idea is developed below.
Definition of π (M k ) for M k ⊆ s 1 Let G s = (U s , A s ) be the sample graph of T -stage snowball sampling, with reference set be the reduced sample graph obtained from dropping s 1,T , with reference set s * 2 = s 1 × s 1 , where A * s = A s \{ i j ; i ∈ s 1 , j ∈ s 1,T } and U * s = U s \s 1,T = s 1 . Notice that A * s contains all the edges between any i, j ∈ s 1 in the population graph G, and G * s is the same sample graph that is obtained from s 1 by induced observation directly. It follows that one observes the motif for any M k ∈ s 1 , so that the inclusion probability π (M k ) is given by where π (i 1 )(i 2 )···(i k ) is given by (6) and (7) as before.  (8), (6) and (7).
Other reduced graphs The sample graph G * s is obtained from dropping the T th wave nodes s 1,T . Rewrite G * s as G Comparisons between G * s and G s On the one hand, whichever motif of interest, G s always has a larger or equal number of observations than G * s . Hence, one may expect a loss of efficiency with G * s . On the other hand, estimation based on G s requires more computation than G * s . Firstly, for any M k ⊆ s 1 , it requires about k times extra computation for π (M k ) by (5) than by (8). This is due to the need to compute the probability of possibly observing M k   Table 1 lists 6 selected triad (M 3 ) inclusion probabilities given by (5) and (8), respectively, with respect to s 2 = s 1 ×U and s * 2 = s 1 × s 1 . These are seen to be equal to the true probabilities calculated directly over all possible initial samples s 1,0 , under SRS of sample size 3. Table 2 shows the estimates of the four 3rd-order graph totalsθ 3,h , for h = 0, 1, 2, 3, which are as defined in Sect. 3.1.3, based on these two sample graphs G s and G * s . The expectation and standard error of each estimators are also given in Table 2, which can be evaluated directly over all the possible initial sample s 1,0 . The true totals in the population graph G are (θ 3,0 , θ 3,1 , θ 3,2 , θ 3,3 ) = (121, 123, 40, 2). Clearly, both HT-estimators are unbiased, and using G * s entails a loss of efficiency against G s , as commented earlier.

Proportional representative sampling in graphs
A traditional sampling method is sometimes said to be (proportional) representative if the sample distribution of the survey values of interest is an unbiased estimator of the population distribution directly. This is the case provided equal probability selection. Equipped with the general formulae for π (M k ) under T -stage snowball sampling, below we propose and examine a proportional representativeness concept for graph sampling.
Graph proportional representativeness Let m k = m k be two distinct motifs of the order k. A graph sampling method is kth order proportionally representative (PR k ) if where θ is the number of m k in the population graph G, and θ s that of the observed m k in the sample graph G s with reference set s 2 , and similarly with θ and θ s for m k . Result 2. One-stage snowball sampling is P R k for k ≥ 2, provided s 2 = s 1 × U ∪ U × s 1 and symmetric design p(s 1 ). Suppose first reciprocal observation. We have R i = {i} ∪ η [1] i , whose cardinality varies for different nodes in G. It follows that π (M 1 ) = π (i) by (3) is not a constant over U , i.e. the design is not PR 1 . Next, for M k with k ≥ 2, π (M k ) by (5) depends on k + 1 probabilities given by (6) and (7). Each relevant probabilityπ(L) is only a function of |R L | provided symmetric design p(s 1 ), where R L = i∈L R i = L since R i = {i} given T = 1. It follows that |R L | = |L| regardless of the nodes in M k , such that π (M k ) is a constant of M k , i.e. PR k . Similarly for forward observation in digraphs.
Remark Setting s * 2 = s 1 × s 1 yields induced sample graph from s 1 and Result 1.
Result 3. T -stage snowball sampling is generally not P R k for k ≥ 1 and T ≥ 2, despite symmetric design p(s 1 ). As under 1-stage snowball sampling, the design is not PR 1 . Whether by (5) or (8) for k ≥ 2, π (M k ) depends onπ(L) in (6), which is only a function of |R L | provided symmetric design p(s 1 ). However, given T ≥ 2 and |L|, R L = i∈L R i generally varies for different L, so that neither R L nor |R L | is a constant of the nodes in M k , i.e. the design is not PR k . Similarly for forward observation in digraphs.

Network sampling methods
As prominent examples from the network sampling literature we consider here multiplicity sampling [1], indirect sampling [19] and adaptive cluster sampling [24]. Below we first summarise broadly their characteristics in terms of target parameter, sampling and estimator, and then discuss four salient applications of these methods using the snowball sampling theory developed in Sect. 4.
Target parameter In all the network sampling methods mentioned above, the target parameter is the total of a value associated with each node of the graph, denoted by y i for i ∈ U , which can be referred to as a 1st-order network total θ = i∈U y i in light of (1). This does not differ from that when "conventional" sampling methods are applied for the same purpose, where Sirken [22] uses the term conventional in contrast to network. In other words, these network sampling methods have so far only been applied to overcome either certain deficiency of frame or lack of efficiency of the traditional sampling methods, as discussed below in terms of sampling and estimator, but not in order to study genuine network totals or parameters, which are of orders higher than one.
Sampling Like in the definition of sample graph, these network sampling methods start with an initial sample s 1 . The sampling frame of s 1 can be direct or indirect. In the latter case, the sampling units are not the population elements. This may be necessary because a frame of the population elements is unavailable, such as when siblings are identified by following up kins to the household members of an initial sample of households [22]. Or, a frame of the elements may be available but is unethical to use, such as when children are accessed via a sample of parents [19]. In cases a direct frame of elements is used, the initial sample s 1 may be inefficient due to the low prevalence of in-scope target population elements, so that an observation procedure depending on the network relationship (between the elements) is used to increase the effective sample size. This is the case with adaptive cluster sampling (Thompson, 1989).
Estimator For 1-st order network parameters (1), where the population elements are represented as nodes in the population graph G = (U, A), the HT-estimator (2) is defined for the observed nodes in the sample graph G s = (U s , A s ). Another approach in the aforementioned methods is the HT-estimator defined for the selected sampling units. Let F be the frame of sampling units, where l ∈ F has inclusion probability π l . We have l∈F z l = l∈F i∈U where z l = i∈U w li y i is a value constructed for the sampling units, based on any chosen weights, provided k∈F w ki = 1, as noted by Birnbaum and Sirken [1]. The corresponding HT-estimator that is unbiased for θ can be given bỹ where δ l = 1 if l ∈ s 1 and 0 otherwise. To ensure that z l can be calculated no matter which actual sample s 1 , the weights w li must not depend on s 1 . A common approach is to set w li = 1/m i , where l a sampling unit in s 1 which gives rise to i, and m i is the number of all sampling units in F that could lead to the observation of i, for i ∈ U . The number m i is referred to as the multiplicity of the element [1]. The observation of m i for each sample element is the same kind of requirement as the observation of the ancestors of a node in U s under snowball sampling. The literature is inconclusive on the relative efficiency between the two estimators (2) and (10).

Sampling patients via hospitals
Birnbaum and Sirken [1] consider this situation, without using graph representation. To fix the idea, suppose a sample of hospitals is selected according to a probability design. From each sample hospital, one observes a number of patients of a given type, who are treated at this hospital. Let the target parameter θ be the population size of such patients. The complication arises from the fact that a patient may receive treatment at more than one hospital. Sirken [22] refers to conventional sampling where every population element is linked to one and only one sampling unit, whereas in the case of network sampling a population element (i.e. patient of a certain type) can be linked to a varying number of sampling units (i.e. hospitals).
Sirken [22] refers to 'cluster' as the group of population elements which are linked to the same sampling unit, and 'network' the group of sampling units which are linked to the same population element. The distinction between cluster and network here needs to be accounted for in estimation. The observation procedure must be incident ancestral, so that m i is observed for i ∈ α(s 1 ), without including in the sample graph G s all the edges that are incident at i but outside of s 2 . The inclusion probability π (i) is given by (3), where we have η [0] i = ∅ since U 0 ∩ P = ∅, and η [1] i = β i , so that R i = β i and |R i | = m i . Let y i = 1 for all i ∈ P.
Remark The HT-estimator (2) and (10) correspond to the first two estimators proposed by Birnbaum and Sirken [1]. Their third estimator is defined for the edges in the projection graph, which however lacks a formulation that allows it to be applied generally.
Two-stage snowball sampling Consider 2-stage snowball sampling in the same graph, under which the observation procedure is incident but needs not be ancestral in addition. Given s 1,0 ⊂ H , let s 1,1 = α(s 1,0 ) ⊆ P and s 1,2 = α(s 1,1 ) ⊆ H , i.e. reverse projection. The HT-estimator (2) makes only use of the nodes (i.e. motif of interest) in s 1,1 , where y i ≡ 1, and π (i) is given by (3), for which R i = β i is fully observed due to the addition of s 1,2 .

Sampling children via parents
Lavalleè [19] considers this situation. Children are the population elements. Suppose a sample of parents is selected according to a probability design. One obtains all the children of each sample parent. Without losing generality, let the target parameter θ be the number of children who are not orphans. The same complication arises from the fact that a child may be accessed via two parents if they are both in the sampling frame. Clearly, the situation is conceptually the same as sampling patients via hospitals above.
Remark Lavalleè [19] represents the situation using the same graph (P) above, where U = P ∪ C, and P consists of the parents and C the children. The HT-estimator (2) based on either 1-or 2-stage snowball sampling formulation is the same as above, with y i ≡ 1 for i ∈ C. Lavalleè [19] considers only the HT-estimator (10). Remark Making population elements the edges of the graph is not convenient for the hospitalpatient application, because while each child corresponds to only one edge, each patient may appear as multiple edges incident to different nodes (i.e. hospitals).

Sampling siblings via households
Sirken [22] discusses this situation, without resorting to explicit graph representation. To fix the idea, suppose a sample of households is selected according to a probability design. For each member of the household, one obtains all the siblings who may or may not live in the same household. The observation elements are siblings, denoted by S, which excludes anyone who has no siblings. Without losing generality, let θ be the number of siblings.
(2P) Twice projection graph Denote by H the households, P the persons, and S the siblings, where i ∈ S is considered a different element to j ∈ P, even if i and j refer to the same person in real life. Let G = (U, A), where U = H ∪ P ∪ S and A = A H P ∪ A P S . Each A hj ⊂ A H P is such that h ∈ H and j ∈ P, i.e. A H P projects H onto P; each A i j ∈ A P S is such that i ∈ P and j ∈ S are siblings, including when the two refer to the same person, i.e. A P S projects P onto S. Let the twice projection graph from H to P to S be undirected. Consider 2-stage snowball sampling starting from s 1,0 ⊂ H = U 0 . Let s 2 = s 1 × U , where s 1 = s 1,0 ∪ s 1,1 is the sample of seeds. The observation procedure must be incident ancestral, provided which the HT-estimator (2) is only based on s 1,2 . For i ∈ S, we have y i = 1 and π (i) given by (3), where η [0] i = η [1] i = 0 because it can only be reached from s 1,0 in exactly two waves, and η i = η [2] i where |η i | = m i is the number of households that can lead to i from s 1,0 , i.e. its multiplicity according to Birnbaum and Sirken [1].
person j belongs to household i, or (ii) persons i and j are siblings of each other. The edges of type (i) project H on to P, whereas those of type (ii) are relations within P. Notice that each group of siblings form a clique; a person without siblings is a single-node clique. To ensure ancestral observation, consider 3-stage snowball sampling. Given s 1,0 ⊂ H = U 0 , s 1,1 consists of the members of the households in s 1,0 , and s 1,2 the siblings of s 1,1 which are outside of the initial sample households, and s 1,3 ⊆ H consists of the households to s 1,2 . Let s 2 = s 1 ×U , where s 1 = s 1,0 ∪ s 1,1 ∪ s 1,2 . The HT-estimator (2) makes use of i ∈ s 1 ∩ S, with y i ≡ 1. The corresponding π (i) is given by (3), where η [0] i = 0, and η [1] i is the household of i, and η [2] i contains the households of its out-of-household siblings. In other words, η i contains all the households that can lead to i, where |η i | = m i .
Remark Sampling in the graphs (2P) and (PR) makes use of relationships among the population elements, unlike sampling of patients or children in the projection graph (P).
(HP) Hypernode projection graph Let each clique in the graph (PR) above be a hypernodeall the nodes of a hypernode are always observed together or not at all. Let G = (U, A), where U = H ∪ P, and P consists of all the hypernodes of P. Let a i j = 1 iff at least one node in the hypernode j belongs to household i. This yields an undirected simple graph as the hypernode projection graph. Consider 2-stage snowball sampling with U 0 = H as in the projection graph, such that observation is ancestral by construction. Both HT-estimators (2) and (10) follow directly, where y i is the number of nodes in i ∈ P.

Adaptive cluster sampling of rare species
In contrast to conventional sampling, Thompson [24] characterises adaptive sampling designs as those in which the procedure to include units in the sample depends on the values of interest observed during the survey. To fix the idea, suppose an area is divided into (spatial) grids as the units of sampling and observation. Each grid in an initial sample of grids is surveyed for a given species of interest. If it is not found there, one would move on to another grid in the initial sample. However, whenever the species is found in grid i, one would survey each of its neighbour grids in four directions, beyond the initial sample, provided not all of them have been surveyed before. This observation procedure can help to increase the number of in-scope grids, compared to random sampling of the same amount of grids, provided the species is more likely to be found given that it is found in a neighbour grid than otherwise. Once in a new grid, the procedure is repeated and the survey may or may not continue to the neighbour grids, depending on the finding in the current grid. The sampling is finished if no new grids can be added to the sample, or if one has reached a predetermined limit in terms of the number of surveyed grids, time, resource, etc. The observed in-scope grids form sampling as well as observation clusters, in the sense that all the member grids of a cluster are sampled and observed if any one of them is.
(T) Transitive graph Adaptive cluster sampling (ACS) can be represented as 2-stage snowball sampling in a transitive graph as follows. Let G = (U, A), where U contains all the grids in ACS. Let U A contain all the grids where the rare species is present. Let U c A = U \U A . Let a i j = 1 iff i, j ∈ U A and i and j belong to the same observation cluster under the ACS. This yields an undirected simple transitive graph, where each i ∈ U c A is an isolated node, and each group of connected nodes in U A form a clique. Without losing generality, let θ = |U A |. The snowball sampling starts with s 1,0 ⊂ U = U 0 , i.e. any grid can be selected initially. Let s 1,1 = α(s 1,0 ). Notice that the isolated nodes in s 1,0 do not lead to any nodes in s 1,1 , while a connected node in s 1,0 leads to all the nodes in the same observation cluster but none in U c A , since edges exist only among the nodes in U A . In reality, a neighbour grid of i ∈ U A ∩s 1,0 which belongs to U c A is also surveyed, but it will not lead to any additional nodes in the next wave, nor will it be the motif of interest in estimation. It is therefore convenient to represent this adaptive nature of the ACS by not including in s 1,1 any node from U c A at all. The 2nd-wave snowball sample will be empty, i.e. s 1,2 = ∅, because all the connected nodes in a clique will already be observed either in s 1,0 or s 1,1 . But the 2nd-stage is needed to ensure that the observation is ancestral by construction. The HT-estimator (2) uses every node i ∈ s 1 = s 1,0 ∪ s 1,1 , with y i = 1, and π (i) is given by (3), where η [0] i = {i}, and η [1] i contains all its adjacent nodes.

Remark
The graph (T) is the same as the relation part of the graph (PR) in the case of sampling siblings via households. The projection part is not necessary here because the initial sampling uses a direct frame, unlike the other applications above.
Remark The ACS can as well be represented by the graph (HP), with the cliques in the graph (T) above as the hypernodes. Both HT-estimators (2) and (10) follow directly.

Concluding remarks
In this paper we synthesised the existing graph sampling theory, and made several extensions of our own. We proposed a definition of sample graph, to replace the different samples of nodes, dyads, triads, etc. This provides formally an analogy between sample graph as a subpopulation graph and sample as a sub-population. Next, we developed a general approach of HT-estimation based on arbitrary T -stage snowball sampling. It is clarified that design-based estimation based on snowball sampling requires the observation procedure to be ancestral, which can be hard to fulfil in many practical applications of snowball or snowball-like sampling, including the estimation of a clandestine target population size. Without satisfying the ancestral requirement, the estimation will have to be based on an appropriate statistical model instead.
We presented various graph sampling formulations of the existing design-based network sampling methods. It is seen that different graph representations reveal the different estimators more or less readily, so the choice matters in applications. The graph sampling theory provides a more general and flexible framework to study and compare these unconventional methods, and to develop possible alternatives and modifications.
Moreover, it transpires that these existing network sampling methods do not really differ from conventional sampling with respect to the target parameter. We believe that the scope of investigation can be greatly extended if one starts to consider other genuine network parameters, which can only be studied using a graph representation. Two research directions can be identified in this respect. First, we are currently examining the scope of problems that can be studied using the (hypernode) projection graph, and the properties of the design-based estimation methods. Second, it seems intuitive that a lower-order network parameter can be estimated using a 'smaller' or more fragmented sample graph than a higher-order parameter. It is therefore interesting to understand better the conditions, by which a high-order network parameter can be expressed as a function of lower-order parameters. Although this is perhaps more of a mathematical than statistical problem, such transformations can potentially be very useful for the applications of the graph sampling theory. Developing a comprehensive finitegraph sampling theory, beyond the established finite-population sampling theory, seems an exciting area for future research.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.