The Influence of Clustering Population on Estimation Accuracy of Population Totals Vector

The estimation accuracy of a population totals vector based on a simple cluster sample is considered. The variance-covariance matrix of estimators depends on the intra-cluster spread of variables under study. The spread depends on the partition of the population into clusters. The variance-covariance matrix is evaluated under several variants of clustering algorithm. This lets us find the clustering algorithm providing the most accurate estimation of population vector totals.


Introduction
Research into practical survey sampling is usually based on vector parameters. The purpose of this paper is to simultaneously estimate the population totals of at least two variables. The well-known vector estimator from a simple cluster sample drawn without replacement is considered. Its accuracy is compared with the ordinary vector estimator from a simple random sample drawn without replacement using the variance-covariance matrix. We analyse accuracy of vector estimators using the methodology proposed by Borovkov [2], Jensen [4], Rao [7] and the generalized relative efficiency coefficient proposed by Rao [8]. This coefficient is defined as the maximal eigenvalue of the product of two matrices. In our case, one of the matrices is the variance-covariance matrix of the vector estimator from the cluster sample and the other is the inverse of the variance-covariance matrix of the vector estimator from the simple random sample. Let us add that Wywial [14] analysed the accuracies of estimation procedures of the total based on samples selected according to several sampling designs from a population clustered using several algorithms. This paper is in some sense a generalization of their considerations into simultaneous estimation of totals of at least two variables under study.
The accuracy of the vector estimation of totals based on the cluster sample depends on partitioning a population into clusters. The influence of the partitioning a population on accuracy of the estimation is considered. This problem was analysed by means of the several methods of measuring vector estimation.
The obtained results should be useful in survey sampling conducted, e.g. by statistical offices. In this case, census data are usually available. Variables under study (observed during a census) can be used as auxiliary variables in a survey sampling on a subsequent occasion. In this situation, appropriate clustering of the population considered in this paper should contribute to improving future sampling strategies.
One of the aspects of big-data analysis is the problem of reduction in the data (observations of variables) number. The results of this paper partially contribute to solve this problem because considered methods provide partitions of a population into such clusters that each of them is similar to the population as close as it is possible. More precisely, the spreads of data observations in the clusters are not less then the spread of the data in the population.
The main results of this paper are as follows: • the variance-covariance matrix of the vector estimation of totals from the simple cluster sample is shown as a function of the matrix of homogeneity coefficients, Sect. 2.2 and "Appendix", • properties of the homogeneity matrix let us show when the vector estimator from the cluster sample is more accurate than the vector estimator from the simple sample , Sect. 2.3 and "Appendix", • several algorithms for partitioning a population into mutually disjoint and nonempty clusters are proposed, Sect. 2.4, • using these algorithms to partition of the population of Swedish municipalities into clusters, Sect. 3, • for theses partitions values of the generalized coefficient of the relative efficiency of the vector estimator from the cluster sample are evaluated what let us analyse influence the partition of the population on the accuracy estimation, Sect. 3.

Basic Notations
Let U be a population of size N. where ȳ is the vector of population means, y U is the vector of totals, C is the matrix of variances and covariances, R is the matrix of correlation coefficients and D is the diagonal matrix of variances. Moreover, let where k ∈ U , i = 1, ..., m , j = 1, ..., m , h = 1, ..., G . ȳ U h ,i is the mean of the i-th variable in the h-cluster, y U h ,i is the total of the i-th variable in the h-th cluster, C h is the variance-covariance matrix in the h-th cluster, and ȳ U is the vector of the means of the cluster totals, C U is the variance-covariance matrix of the cluster totals.

Simple Cluster Sampling
Let sample s be drawn from population U or partition U . The random sample will be denoted by the capital letter S while its observation by s. Cluster sample s is defined as a g-element set of clusters U h drawn from partition U . The well-known simple cluster sampling design is defined as where s ∈ S U and S U is the sampling space generated for set U .
The vector of the unbiased estimator of the population total y U is as follows: Its variance-covariance matrix is: where ỹ S is evaluated on the basis of the simple cluster sample drawn without replacement. Generalizing the results of [9, pp. 129-133] into multidimensional case, we derived in "Appendix" the following expression (see also [11,12]): where: Parameter is the matrix of the coefficients of intra-cluster data spread homogeneity or simply the homogeneity matrix. The intra-cluster variancecovariance matrix is denoted by C * . Let us underline that when N h = M for all h = 1, ..., H , then A = O . Sarndal et al. [9] proved that all diagonal elements of take values from − G−1 N−G ;1 . Let be an eigenvalue of . In the last part of the "Appendix" is proved the following inequality: Kish [5] provided sound advices on grouping problems that might be encountered in practical surveys. (1)

Relative Efficiency
Let t 1s and t 2s be the unbiased estimators of vector parameter ∈ . Borovkov [2] proposed comparing the accuracy of vector estimators using the following definition (see also [7] or [12], pp. 28-29): Definition 1 Estimator t 1s is not worse than t 2s if and only if: Estimator t 1s is better than t 2s if and only if t 1s is not worse than t 2s and the above inequality becomes sharp for at least one fixed parameter . This definition directly leads to the following, see [7] and Borovkov [2]: where tr V(t is ) , det V(t is ) and λ V(t is ) are called the mean square radius, the generalized variance and the the spectral radius (maximal eigenvalue of The accuracy of estimator ỹ S is compared with the accuracy of the following wellknown estimator of the vector of totals from an ordinary simple random sample drawn without replacement from a whole population: where S is drawn without replacement according to sampling design: , s ∈ S and S is the sampling space generated for U. Under the assumption that n = gN , we have: According to Theorem 1, the estimator ỹ s is not worse than y s , when C U −NC is non-positive definite. Particularly, when N h = M for all h = 1, ..., G , expressions (4), (7) and (8) let us write: The following theorem is proved in the "Appendix" When the clusters are of the same size, Theorem 1 and expressions (9) let us conclude that when matrix is non-positive (non-negative) definite, then estimator ỹ s is not worse (not better) than y s .
Rao and Scott [8, pp. 223], define the generalized relative efficiency coefficient as follows: where V(y S ) is non-singular. When n = gN , expressions (2), (3) and (10) lead to the following: Hence, deff (ỹ S ) = minim when the population is partitioned into set U of clusters in such a way that λ(C −1 C U ) = minim.
In particular, expressions (3) and (4) show that when N h = M for all h = 1, ..., H , then A = 0 . Inequality − G−1 N−G ≤ λ( ) ≤ 1 leads to the following: When λ( ) ≤ 0 , then ỹ S is more efficient than y S . Hence, we should partition the population into clusters of the same size in such a way that coefficient λ( ) takes the minimal negative value.

Clustering Algorithms
We can expect that variables observed in a finite and fixed population in a past occasion are highly correlated with the appropriate variables observed in a current occasion or in future occasions. Therefore, census data could be used to construct reasonable sampling design for future occasion. The above considerations lead to the conclusion that the population has to be clustered in such a way that the maximal eigenvalue of C −1 C U takes the minimal value. Additionally, when we assume that the population has to be partitioned into clusters of the same size, then minimization of λ( ) is the criterion for population clustering. The following clustering algorithms will be considered: Systematic algorithm 1: Let us assume that y k > 0 for all k = 1, ..., N . Next, we evaluate squared distances d k = y k y T k of y k from the zero vector 0 for all k ∈ U . Let us assume that d k ≤ d k+1 for k = 1, ..., N − 1 . The h-th cluster is identified by the unit labels k ∈ U h that k = (i − 1)G + h , for i = 1, ..., M and h = 1, ..., G . This leads to inequalities: The result of this clustering algorithm will be denoted by U 1 . In some sense, this result is the well-known systematic simple sample space.
the current partition and we start iteration t + 2 of the algorithm. If λ t+1 ≥ λ t , then we start stage t + 2 of the algorithm from partition U (t) . The algorithm of the partition is stopped when the number of iterations reaches the assumed level T. This algorithm leads to the minimization of deff (ỹ S ) . The population clustered according to this algorithm will be denoted by U 4 .
Iteration algorithm 5: The clustering procedure described below is similar to the above one and also leads to minimization of V(ỹ S ).
Let U (t) = {U (t) 1 , ..., U (t) G } be the partition of the population obtained as result of the t-th iteration where t = (l − 1)N + k , k = 1, ..., N , l = 1, 2, ... and let λ t = λ(C −1 C U (t) ) be the maximal eigenvalue evaluated on the basis of z = 1, ..., G and calculated using the following The iteration clustering process is stopped when λ t+N = λ t or the number of iterations reaches the assumed level T. This algorithm will be denoted by U 5 .
Analysis of Table 1 leads to the following conclusions. Only under clustering algorithms U 1 and U 2 , the accuracy of estimator y S is approximately not less than the accuracy of estimator ỹ S for all considered combinations (M, g).
Partition U 4 leads to the most efficient estimation based on ỹ S . When we also assume that the population is split into sub-populations of the same sizes, estimator ỹ S based on the sample drawn from the population clustered according to algorithm U 3 is the most efficient.
For algorithms U 1 and U 2 , the estimation efficiency based on ỹ S decreases (or equivalently deff (ỹ S ) increases) when the number of clusters g decreases under the fixed sample size n. For algorithms U 3 -U 6 , the situation is reversed. Under the fixed sample size n, the estimation efficiency based on ỹ S increases when number of clusters g decreases. For instance, under partition U 4 when (M, g) = (2,14) and (M, g) = (14, 2) , the accuracy of ỹ S is almost two times and fifty times better than the accuracy of y S , respectively.

Conclusions
In this paper, we have shown that it is possible to significantly increase the accuracy of estimating population totals using vector estimator from a simple cluster sample drawn without replacement by considering specific partition of a population into clusters. In the analysed empirical example, algorithm 5 and 6 lead to the optimal partition of the population. These algorithms should work quickly when a population size is large. The results could be useful for panel or census survey sampling repeated on more than one occasion. The results of paper could be applied to partitioning a population into clusters based on census data. This could improve accuracy of estimation of population total vector. Moreover, the results could be useful in some aspects of big-data analysis. This paper could be treated as a contribution to comparison of vector estimators. Several properties of the generalized relative efficiency coefficient are considered in Theorem 2. The generalized coefficient of intra-cluster data spread homogeneity was defined, its properties were considered and its values were interpreted. The generalized deff coefficient was also written as the function of matrix of coefficients of intra-cluster homogeneity. The proposed procedures could be developed in several ways. Other clustering algorithms could be considered. In particular, the clustering procedures based on multivariate variables that are proposed in this paper could be reduced to one-dimensional cases. For instance, these variables could be replaced with their principal component. In this case, the several clustering procedures based on one-dimensional variables that have been proposed by [14] could be adopted in our considerations.
In addition, many of the clustering algorithms available in the statistical literature (see, e.g. [1,6]) divide the population into homogeneous clusters. Typically, these procedures can be modified to algorithms that ensure the maximum spread of multivariate observations within the cluster. This seems to the well-known nearest (farthest) neighbour criteria. Properties of some sampling designs used in spatial statistics could inspirate for the construction of clustering algorithms. For example, the criteria considered by Thompson and Seber [10] or [13] can be adapted to divide the spatial population into clusters composed of non-neighbours.

Proof of Expression (6) About Eigenvalues of the Homogeneity Matrix
The eigenvalues of the homogeneity matrix, given by (4), are roots of the following equation: According to the properties of matrix (see, e.g. [3], pp. 219) C = G T G , where G is symmetric because matrix C is symmetric and positive definite. Therefore: where M = (G −1 ) T C * G −1 . Matrix M is non-negative definite (see, e.g. [3], pp. 213). This let us write 1 − ≥ 1 and ≤ 1.