1 Introduction

Research into practical survey sampling is usually based on vector parameters. The purpose of this paper is to simultaneously estimate the population totals of at least two variables. The well-known vector estimator from a simple cluster sample drawn without replacement is considered. Its accuracy is compared with the ordinary vector estimator from a simple random sample drawn without replacement using the variance-covariance matrix. We analyse accuracy of vector estimators using the methodology proposed by Borovkov [2], Jensen [4], Rao [7] and the generalized relative efficiency coefficient proposed by Rao [8]. This coefficient is defined as the maximal eigenvalue of the product of two matrices. In our case, one of the matrices is the variance-covariance matrix of the vector estimator from the cluster sample and the other is the inverse of the variance-covariance matrix of the vector estimator from the simple random sample. Let us add that Wywial [14] analysed the accuracies of estimation procedures of the total based on samples selected according to several sampling designs from a population clustered using several algorithms. This paper is in some sense a generalization of their considerations into simultaneous estimation of totals of at least two variables under study.

The accuracy of the vector estimation of totals based on the cluster sample depends on partitioning a population into clusters. The influence of the partitioning a population on accuracy of the estimation is considered. This problem was analysed by means of the several methods of measuring vector estimation.

The obtained results should be useful in survey sampling conducted, e.g. by statistical offices. In this case, census data are usually available. Variables under study (observed during a census) can be used as auxiliary variables in a survey sampling on a subsequent occasion. In this situation, appropriate clustering of the population considered in this paper should contribute to improving future sampling strategies.

One of the aspects of big-data analysis is the problem of reduction in the data (observations of variables) number. The results of this paper partially contribute to solve this problem because considered methods provide partitions of a population into such clusters that each of them is similar to the population as close as it is possible. More precisely, the spreads of data observations in the clusters are not less then the spread of the data in the population.

The main results of this paper are as follows:

  • the variance-covariance matrix of the vector estimation of totals from the simple cluster sample is shown as a function of the matrix of homogeneity coefficients, Sect. 2.2 and “Appendix”,

  • properties of the homogeneity matrix let us show when the vector estimator from the cluster sample is more accurate than the vector estimator from the simple sample , Sect. 2.3 and “Appendix”,

  • several algorithms for partitioning a population into mutually disjoint and non-empty clusters are proposed, Sect. 2.4,

  • using these algorithms to partition of the population of Swedish municipalities into clusters, Sect. 3,

  • for theses partitions values of the generalized coefficient of the relative efficiency of the vector estimator from the cluster sample are evaluated what let us analyse influence the partition of the population on the accuracy estimation, Sect. 3.

2 Estimation Based on Cluster Sample

2.1 Basic Notations

Let U be a population of size N. The number of variables observed in U is denoted by m. Observations of a vector variable will be denoted by \(\varvec{y}_k=[y_{k,1}...y_{k,m}]\) where \( k \in U \). Let us assume that the population is partitioned into disjoint sub-population \(U_h\) of sizes \(N_h\), \(h=1,...,G\), called clusters. Hence, \(N=\sum _{h=1}^{G}N_h\) and \(\bar{N}=G^{-1}\sum _{h=1}^{G}N_h\). Let \(\bar{N}=M\), if all the clusters are of the same size. Let \({\mathcal{U}}=\{U_1,...,U_h,...,U_G\}\) be a partition of the population elements into clusters. Hence, \({\mathcal{U}}\) is the set of G-mutually disjoint and non-empty clusters. Let

$$\begin{aligned}&\bar{\varvec{y}}=[\bar{y}_1...\bar{y}_m]=\sum _{k\in {U}}\varvec{y}_k/N,\quad \varvec{y}_U=N\bar{\varvec{y}}=\sum _{k\in {U}}\varvec{y}_k=[y_{U,1}...y_{U,m}], \\&y_{U,i}=\sum _{k\in {U}}y_{k,i}, \quad \varvec{C}=[c_{i,j}],\quad c_{i,j}=\sum _{k\in {U}}(y_{k,i}-\bar{y}_i)(y_{k,j}-\bar{y}_j)/(N-1), \\&\varvec{R}=\varvec{D}^{-1/2}\varvec{C}\varvec{D}^{-1/2}=[r_{i,j}],\quad \varvec{D}=[v_i],\quad r_{i,j}=\frac{c_{i,j}}{\sqrt{v_iv_j}},\quad v_i=c_{i,i} \end{aligned}$$

where \(\bar{\varvec{y}}\) is the vector of population means, \(\varvec{y}_U\) is the vector of totals, \(\varvec{C}\) is the matrix of variances and covariances, \(\varvec{R}\) is the matrix of correlation coefficients and \(\varvec{D}\) is the diagonal matrix of variances. Moreover, let

$$\begin{aligned}&\bar{\varvec{y}}_{U_h}=\sum _{k\in {U}_h}\varvec{y}_k/N_h, \quad \bar{\varvec{y}}_{U_h}=[\bar{y}_{U_h,1}...\bar{y}_{U_h,m}],\quad \bar{y}_{U_h,i}=\sum _{k\in {U}_h}y_{k,i}/N_h, \\&\varvec{y}_{U_h}=N_h\bar{\varvec{y}}_{U_h}=\sum _{k\in {U}_h}\varvec{y}_k=[y_{U_h,1}...y_{U_h,m}],\quad y_{U_h,i}=\sum _{k\in {U}_h}y_{k,i}, \\&\varvec{C}_{U_h}=[c_{U_h,i,j}],\quad c_{U_h,i,j}=\sum _{k\in {U}_h}(y_{k,i}-\bar{y}_{U_h,i})(y_{k,j}-\bar{y}_{U_h,j})/(N_h-1), \\&\bar{\varvec{y}}_{{\mathcal{U}}}=\sum _{h=1}^G\varvec{y}_{U_h}/G=\varvec{y}_U/G=[\bar{y}_{{\mathcal{U}},1}...\bar{y}_{{\mathcal{U}},m}],\;\; \bar{y}_{{\mathcal{U}},i}=\sum _{h=1}^Gy_{U_h,i}/G=y_{U,i}/G, \\&\varvec{y}_{{\mathcal{U}}}=G\bar{\varvec{y}}_{{\mathcal{U}}}=\sum _{h=1}^G\varvec{y}_{U_h}=\varvec{y}_U,\quad \varvec{C}_{\mathcal{U}}=[c_{{\mathcal{U}},i,j}], \\&c_{{\mathcal{U}},i,j}=\sum _{h=1}^G({y}_{U_h,i}-\bar{y}_{{\mathcal{U}},i})({y}_{U_h,j}-\bar{y}_{{\mathcal{U}}},j)/(G-1) \end{aligned}$$

where \(k\in {U}\), \(i=1,...,m\), \(j=1,...,m\), \(h=1,...,G\). \(\bar{y}_{U_h,i}\) is the mean of the i-th variable in the h-cluster, \(y_{U_h,i}\) is the total of the i-th variable in the h-th cluster, \(\varvec{C}_h\) is the variance-covariance matrix in the h-th cluster, and \(\bar{\varvec{y}}_{{\mathcal{U}}}\) is the vector of the means of the cluster totals, \(\varvec{C}_{\mathcal{U}}\) is the variance-covariance matrix of the cluster totals.

2.2 Simple Cluster Sampling

Let sample s be drawn from population U or partition \({\mathcal{U}}\). The random sample will be denoted by the capital letter S while its observation by s.

Cluster sample s is defined as a g-element set of clusters \(U_h\) drawn from partition \({\mathcal{U}}\). The well-known simple cluster sampling design is defined as \(P_1(s)=\left( {\begin{array}{c}G\\ g\end{array}}\right) ^{-1}\) where \(s\in \varvec{S}_{\mathcal{U}}\) and \(\varvec{S}_{\mathcal{U}}\) is the sampling space generated for set \({\mathcal{U}}\). The vector of the unbiased estimator of the population total \(\varvec{y}_U\) is as follows:

$$\begin{aligned} \tilde{\varvec{y}}_S=\frac{G}{g}\sum _{h\in S}\sum _{k\in {U}_h}\varvec{y}_k=\frac{G}{g}\sum _{h\in S}\varvec{y}_{U_h}, \end{aligned}$$
(1)

Its variance-covariance matrix is:

$$\begin{aligned} \varvec{V}(\tilde{\varvec{y}}_S)=\frac{G(G-g)}{g}\varvec{C}_{\mathcal{U}} \end{aligned}$$
(2)

where \(\tilde{\varvec{y}}_S\) is evaluated on the basis of the simple cluster sample drawn without replacement. Generalizing the results of [9, pp. 129–133] into multidimensional case, we derived in “Appendix” the following expression (see also [11, 12]):

$$\begin{aligned} \varvec{V}(\tilde{\varvec{y}}_S)=\frac{G(G-g)}{g}\bar{N}\varvec{C}\left( \varvec{I}_m+\frac{N-G}{G-1}\varvec{\varDelta }\right) +\frac{G(G-g)}{g}\varvec{A} \end{aligned}$$
(3)

where:

$$\begin{aligned}&\varvec{\varDelta }=\varvec{I}_m-\varvec{C}^{-1}\varvec{C}_*, \nonumber \\&\varvec{A}=[a_{i,j}];\quad a_{i,j}=\frac{1}{G-1}\sum _{h=1}^G(N_h-\bar{N})N_h\bar{y}_{U_h,i}\bar{y}_{U_h,j},\quad i\ne j=1,...,m \end{aligned}$$
(4)

or \(\varvec{A}=\varvec{A}_1+\varvec{A}_2+\varvec{A}_3\),

$$\begin{aligned}&\varvec{A}_1=[a_{i,j}(111)],\quad \varvec{A}_2=[\bar{y}_{{\mathcal{U}},i}a_{i,j}(101)],\quad \varvec{A}_3=[\bar{y}_ja_{i,j}(110)], \nonumber \\&a_{i,j}(bed)=\frac{1}{G-1}\sum _{h=1}^G(N_h-\bar{N})^b(y_{U_h,i}-\bar{y}_{{\mathcal{U}},i})^e(\bar{y}_{U_h,j}-\bar{y}_j)^d, \nonumber \\&\varvec{C}_*=[c_{*i,j}],\quad c_{*i,j}=\frac{1}{N-G}\sum _{h=1}^G\sum _{k\in {U}_h}(y_{k,i}-\bar{y}_{U_h,i})(y_{k,j}-\bar{y}_{U_h,j}), \end{aligned}$$
(5)

or

$$\begin{aligned} c_{*i,j}=\sum _{h=1}^Gw_hc_{*,U_h,i,j},\quad c_{*,U_h,i,j}=\frac{1}{N_h-1}\sum _{k\in {U}_h}(y_{k,i}-\bar{y}_{U_h,i})(y_{k,j}-\bar{y}_{U_h,j}) \end{aligned}$$

and \(w_h=\frac{N_h-1}{N-G}\). Parameter \(\varvec{\varDelta }\) is the matrix of the coefficients of intra-cluster data spread homogeneity or simply the homogeneity matrix. The intra-cluster variance-covariance matrix is denoted by \(\varvec{C}_*\). Let us underline that when \(N_h=M\) for all \(h=1,...,H\), then \(\varvec{A}=\varvec{O}\). Sarndal et al. [9] proved that all diagonal elements of \(\varvec{\varDelta }\) take values from \(\left[ -\frac{G-1}{N-G};1\right]\). Let \(\delta\) be an eigenvalue of \(\varvec{\varDelta }\). In the last part of the “Appendix” is proved the following inequality:

$$\begin{aligned} -\frac{G-1}{N-G}\le \delta \le 1. \end{aligned}$$
(6)

Kish [5] provided sound advices on grouping problems that might be encountered in practical surveys.

2.3 Relative Efficiency

Let \(\varvec{t}_{1s}\) and \(\varvec{t}_{2s}\) be the unbiased estimators of vector parameter \(\varvec{\theta }\in \varvec{\varTheta }\). Borovkov [2] proposed comparing the accuracy of vector estimators using the following definition (see also [7] or [12], pp. 28–29):

Definition 1

Estimator \(\varvec{t}_{1s}\) is not worse than \(\varvec{t}_{2s}\)if and only if:

$$\begin{aligned} \forall _{\varvec{\alpha }\ne \varvec{0}}\forall _{\varvec{\theta }\in \varvec{\varTheta }}\;\;v(\varvec{t}_{1s}\varvec{\alpha }^T)\le v(\varvec{t}_{2s}\varvec{\alpha }^T) \end{aligned}$$

where \(\varvec{\alpha }=[\alpha _1...\alpha _m]\),

$$\begin{aligned} v(\varvec{t}_{2s}\varvec{\alpha }^T)= \varvec{\alpha }\varvec{V}(\varvec{t}_{is})\varvec{\alpha }^T, \quad i=1,2. \end{aligned}$$

Estimator \(\varvec{t}_{1s}\) is better than \(\varvec{t}_{2s}\) if and only if \(\varvec{t}_{1s}\) is not worse than \(\varvec{t}_{2s}\)and the above inequality becomes sharp for at least one fixed parameter \(\varvec{\theta }\).

This definition directly leads to the following, see [7] and Borovkov [2]:

Theorem 1

Let the variance-covariance matrices \(\varvec{V}(\varvec{t}_{is})\), \(i=1,2\) be positive definite. If estimator \(\varvec{t}_{1s}\) is not worst than \(\varvec{t}_{2s}\) , then \(\varvec{V}(\varvec{t}_{2s})-\varvec{V}(\varvec{t}_{1s})\) is non-negative definite and:

$$\begin{aligned}&tr\left( \varvec{V}(\varvec{t}_{1s})\right) \le tr\left( \varvec{V}(\varvec{t}_{2s})\right) , \\&det\left( \varvec{V}(\varvec{t}_{1s})\right) \le det\left( \varvec{V}(\varvec{t}_{2s})\right) , \\&\uplambda \left( \varvec{V}(\varvec{t}_{1s})\right) \le \uplambda \left( \varvec{V}(\varvec{t}_{2s})\right) , \\&\forall _{j=1,...,m}\;\;v(t_{1,is}))\le v(t_{2,is}) \end{aligned}$$

where \(tr\left( \varvec{V}(\varvec{t}_{is})\right)\), \(det\left( \varvec{V}(\varvec{t}_{is})\right)\) and \(\uplambda \left( \varvec{V}(\varvec{t}_{is})\right)\) are called the mean square radius, the generalized variance and the the spectral radius (maximal eigenvalue of \(\varvec{V}(\varvec{t}_{is})\)) of the vector estimator \(\varvec{t}_{is}\), while \(v(t_{i,js})\) is variance of j-th component of \(\varvec{t}_{is}\). The above inequalities become sharp, when \(\varvec{V}(\varvec{t}_{1s})-\varvec{V}(\varvec{t}_{2s})\) is positive definite.

The accuracy of estimator \(\tilde{\varvec{y}}_S\) is compared with the accuracy of the following well-known estimator of the vector of totals from an ordinary simple random sample drawn without replacement from a whole population:

$$\begin{aligned} \varvec{y}_S=\frac{N}{n}\sum _{k\in S}\varvec{y}_k,\qquad \varvec{V}(\varvec{y}_S)=\frac{N(N-n)}{n}\varvec{C} \end{aligned}$$
(7)

where S is drawn without replacement according to sampling design: \(P_0(s)=\left( {\begin{array}{c}N\\ n\end{array}}\right) ^{-1}\), \(s\in \varvec{S}\) and \(\varvec{S}\) is the sampling space generated for U. Under the assumption that \(n=g\bar{N}\), we have:

$$\begin{aligned} \varvec{V}(\tilde{\varvec{y}}_s)-\varvec{V}(\varvec{y}_S)=\left( \varvec{I}_m+\frac{N-G}{G-1}\varvec{\varDelta }\right) \left( \varvec{C}+\bar{N}\varvec{C}\right) = \frac{G(G-g)}{g}\left( \varvec{C}_{\mathcal{U}}-\bar{N}\varvec{C}\right) . \end{aligned}$$
(8)

According to Theorem 1, the estimator \(\tilde{\varvec{y}}_s\) is not worse than \(\varvec{y}_s\), when \(\varvec{C}_{\mathcal{U}}-\bar{N}\varvec{C}\) is non-positive definite.

Particularly, when \(N_h=M\) for all \(h=1,...,G\), expressions (4), (7) and (8) let us write:

$$\begin{aligned}&\varvec{V}(\tilde{\varvec{y}}_s)-\varvec{V}(\varvec{y}_S)=\frac{N(N-n)}{N}\frac{N-G}{G-1}\varvec{C}\varvec{\varDelta }=\nonumber \\&=\frac{N(N-n)}{N}\frac{N-G}{G-1}(\varvec{C}-\varvec{C}_*). \end{aligned}$$
(9)

If \(N_h=M\) for all \(h=1,...,G\), the estimator \(\tilde{\varvec{y}}_s\) is not worse than \(\varvec{y}_s\), when \(\varvec{C}\varvec{\varDelta }\) is non-positive definite.

The following theorem is proved in the “Appendix

Theorem 2

Let the variance-covariance matrices \(\varvec{V}(\varvec{t}_{is})\), \(i=1,2\) be positive definite. If estimator \(\varvec{t}_{1s}\) is not worse than \(\varvec{t}_{2s}\), then \(\varvec{V}(\varvec{t}_{2s})-\varvec{V}(\varvec{t}_{1s})\) is non-negative definite and:

$$\begin{aligned}&\uplambda \left( \varvec{V}(\varvec{t}_{2s})\varvec{V}^{-1}(\varvec{t}_{1s})\right) =\uplambda \left( \varvec{V}^{-1}(\varvec{t}_{1s})\varvec{V}(\varvec{t}_{2s})\right) \ge 1, \\&\uplambda \left( \varvec{V}^{-1}(\varvec{t}_{2s})\varvec{V}(\varvec{t}_{1s})\right) =\uplambda \left( \varvec{V}(\varvec{t}_{1s})\varvec{V}^{-1}(\varvec{t}_{2s})\right) \le 1 \\&\quad \uplambda _1\left( \varvec{V}(\varvec{t}_{2s})\varvec{V}^{-1}(\varvec{t}_{1s})\right) \le \frac{\varvec{\alpha }\varvec{V}(\varvec{t}_{2s})\varvec{\alpha }^T}{\varvec{\alpha }\varvec{V}(\varvec{t}_{1s})\varvec{\alpha }^T}=\frac{\varvec{V}(\varvec{t}_{2s}\varvec{\alpha }^T)}{\varvec{V}(\varvec{t}_{1s}\varvec{\alpha }^T)}\le \uplambda \left( \varvec{V}^{-1}(\varvec{t}_{1s})\varvec{V}(\varvec{t}_{2s})\right) . \end{aligned}$$

for all \(\varvec{\alpha }\ne \varvec{0}\) where \(\uplambda _1(...)\) is the minimal eigenvalue of a matrix. The above inequalities become sharp, when \(\varvec{V}(\varvec{t}_{1s})-\varvec{V}(\varvec{t}_{2s})\) is positive definite.

When the clusters are of the same size, Theorem 1 and expressions (9) let us conclude that when matrix \(\varvec{\varDelta }\) is non-positive (non-negative) definite, then estimator \(\tilde{\varvec{y}}_s\) is not worse (not better) than \(\varvec{y}_s\).

Rao and Scott [8, pp. 223], define the generalized relative efficiency coefficient as follows:

$$\begin{aligned} deff(\varvec{t}_S)=\uplambda \left( \varvec{V}(\varvec{y}_S)^{-1}\varvec{V}(\varvec{t}_S)\right) . \end{aligned}$$
(10)

where \(\varvec{V}(\varvec{y}_S)\) is non-singular. When \(n=g\bar{N}\), expressions (2), (3) and (10) lead to the following:

$$\begin{aligned} deff(\tilde{\varvec{y}}_S)=\frac{G(G-g)n\bar{N}}{N(N-n)g}\uplambda \left( \varvec{C}^{-1}\varvec{C}_{\mathcal{U}}\right) =1+\uplambda \left( \frac{N-G}{G-1}\varvec{\varDelta }+\frac{1}{\bar{N}}\varvec{C}^{-1}\varvec{A}\right) . \end{aligned}$$
(11)

Hence, \(deff(\tilde{\varvec{y}}_S)=minim\) when the population is partitioned into set \({\mathcal{U}}\) of clusters in such a way that \(\uplambda (\varvec{C}^{-1}\varvec{C}_{\mathcal{U}})=minim\).

In particular, expressions (3) and (4) show that when \(N_h=M\) for all \(h=1,...,H\), then \(\varvec{A}=\varvec{0}\). Inequality \(-\frac{G-1}{N-G}\le \uplambda (\varvec{\varDelta })\le 1\) leads to the following:

$$\begin{aligned} 0\le deff(\tilde{\varvec{y}}_S)=1+\frac{N-G}{G-1}\uplambda (\varvec{\varDelta })\le \frac{N-1}{G-1}. \end{aligned}$$
(12)

When \(\uplambda (\varvec{\varDelta })\le 0\), then \(\tilde{\varvec{y}}_S\) is more efficient than \(\varvec{y}_S\). Hence, we should partition the population into clusters of the same size in such a way that coefficient \(\uplambda (\varvec{\varDelta })\) takes the minimal negative value.

2.4 Clustering Algorithms

We can expect that variables observed in a finite and fixed population in a past occasion are highly correlated with the appropriate variables observed in a current occasion or in future occasions. Therefore, census data could be used to construct reasonable sampling design for future occasion.

The above considerations lead to the conclusion that the population has to be clustered in such a way that the maximal eigenvalue of \(\varvec{C}^{-1}\varvec{C}_{\mathcal{U}}\) takes the minimal value. Additionally, when we assume that the population has to be partitioned into clusters of the same size, then minimization of \(\uplambda (\varvec{\varDelta })\) is the criterion for population clustering. The following clustering algorithms will be considered:

Systematic algorithm 1:

Let us assume that \(\varvec{y}_k>\varvec{0}\) for all \(k=1,...,N\). Next, we evaluate squared distances \(d_k=\varvec{y}_k\varvec{y}_k^T\) of \(\varvec{y}_k\) from the zero vector \(\varvec{0}\) for all \(k\in {U}\). Let us assume that \(d_k\le d_{k+1}\) for \(k=1,...,N-1\). The h-th cluster is identified by the unit labels \(k\in {U}_h\) that \(k=(i-1)G+h\), for \(i=1,...,M\) and \(h=1,...,G\). This leads to inequalities: \(d_{U_h}\le d_{U_{h+1}}\) for \(h=1,...,G-1\) where \(d_{U_h}=\sum _{k\in {U}_h}d_k\). The result of this clustering algorithm will be denoted by \({\mathcal{U}}_1\). In some sense, this result is the well-known systematic simple sample space.

Systematic algorithm 2: Let \(d_k=(\varvec{y}_k-\bar{\varvec{y}})(\varvec{y}_k-\bar{\varvec{y}})^T\) be the squared distance of \(\varvec{y}_k\) from vector \(\bar{\varvec{y}}\) for all \(k\in {U}\). Let us assume that \(d_k\le d_{k+1}\) for \(k=1,...,N-1\). Let \(M=2\) and \(N=MG\). In this case, \(U_h=\{h;N-h+1\}\) for \(h=1,...,G\). In general, when M is even and \(N=MG\), then \(U_h=\{(h-1)\frac{M}{2}+i;N-(h-1)\frac{M}{2}-i+1\}\) for \(h=1,...,G\) and \(i=1,...M/2\). The result of this clustering algorithm will be denoted by \({\mathcal{U}}_2\).

Permutation algorithm 3: Let \({\mathcal{U}}^{(0)}=\{U_1^{(0)},...,U_G^{(0)}\}\) be any start partition of a population partitioned into clusters of the same sizes, \(M\ge m\). In the t-th (t=0,1,...) iteration partition \({\mathcal{U}}^{(t)}=\{U_1^{(t)},...,U_G^{(t)}\}\) is generated through permutating population elements. For an assumed \(t=T\), \({\mathcal{U}}^{(T)}\) is treated as optimal when

$$\begin{aligned} \uplambda _*({\mathcal{U}}^{(T)})= min_{\{t=1,..,T\}}(\uplambda (\varDelta ({\mathcal{U}}^{(t)}))). \end{aligned}$$
(13)

Iteration algorithm 4: Let \({\mathcal{U}}^{(0)}=\{U_1^{(0)},...,U_G^{(0)}\}\) be any start partition of the population partitioned into clusters which are not necessary of the same size. Let \({\mathcal{U}}^{(t)}=\{U_1^{(t)},...,U_G^{(t)}\}\) be the partition of the population obtained as result of the t-th iteration and let \(\uplambda _t=\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}})\) be the maximal eigenvalue of the variance-covariance matrix of cluster totals. Moreover, let \(f: U\rightarrow {\mathcal{U}}^{(t)}\), \(f_t(k)=h\), if and only if \(k\in {U}_h^{(t)}\).

In iteration \(t+1\), we randomly choose number \(k_*\) of data observation from the sequences 1, ..., N. Next, element \(k_*\) is moved from the cluster \(h_\#=f_t(k_*)\) to cluster \(h_*\) where \(h_*\) is randomly drawn from set \(\{h:h=1,...,G; h\ne h_\#\}\). This leads to the new partition \({\mathcal{U}}^{(t+1)}\). Finally, we count \(\uplambda _{t+1}=\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t+1)}})\). If \(\uplambda _{t+1}<\uplambda _t\), then \({\mathcal{U}}^{(t+1)}\) is the current partition and we start iteration \(t+2\) of the algorithm. If \(\uplambda _{t+1}\ge \uplambda _t\), then we start stage \(t+2\) of the algorithm from partition \({\mathcal{U}}^{(t)}\). The algorithm of the partition is stopped when the number of iterations reaches the assumed level T. This algorithm leads to the minimization of \(deff(\tilde{\varvec{y}}_S)\). The population clustered according to this algorithm will be denoted by \({\mathcal{U}}_4\).

Iteration algorithm 5: The clustering procedure described below is similar to the above one and also leads to minimization of \(\varvec{V}(\tilde{\varvec{y}}_S)\).

Let \({\mathcal{U}}^{(t)}=\{U_1^{(t)},...,U_G^{(t)}\}\) be the partition of the population obtained as result of the t-th iteration where \(t=(l-1)N+k\), \(k=1,...,N\), \(l=1,2,...\) and let \(\uplambda _t=\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}})\) be the maximal eigenvalue evaluated on the basis of \({\mathcal{U}}^{(t)}=\{U_1^{(t)},...,U_G^{(t)}\}\). Let \(f: U\rightarrow {\mathcal{U}}^{(t)}\), \(f_t(l)=h\), if and only if \(l\in {U}_h^{(k,t)}\).

In stage \(t+1\), the population element \(k\in {U}_h^{(t)}\), where \(h=f_t(k)\), is moved to clusters \(U_z^{(t)}\), \(z\ne h\), \(z=1,...,G\) and calculated using the following

$$\begin{aligned} (k,\underline{z})=arg\left( min_{\{z=1,...,G,z\ne f_{t}(k)\}}\left( \uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}}(k,z))\right) \right) \end{aligned}$$
(14)

where \(\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}}(k,z))\) is evaluated for partition \({\mathcal{U}}^{(t)}\) in which clusters \(U_z^{(t)}\), \(U_h^{(t)}\) are replaced by \(\{U_z^{(t)}\cup \{k\}\}\) and \(\{U_h^{(t)}-\{k\}\}\), respectively, and \(h=f_{t}(k)\). If \(\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}}(\underline{z}))<\uplambda _{t}\), then \(\uplambda _{t+1}=\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t+1)}})\) and \({\mathcal{U}}^{(t+1)}\) is equal to \({\mathcal{U}}^{(t)}\) where clusters \(U_{\underline{z}}^{(t)}\) and \(U_{\underline{h}}^{(t)}\) are replaced by \(U_{\underline{z}}^{(t+1)}=\{U_{\underline{z}}^{(t)}\cup \{k\}\}\) and \(U_{h}^{(t+1)}=\{U_h^{(t)}-\{k\}\}\), respectively. If \(\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}}(\underline{z}))\ge \uplambda _{t}\), then \({\mathcal{U}}^{(t+1)}={\mathcal{U}}^{(t)}\) and \(\uplambda _{t+1}=\uplambda _{t}\).

The iteration clustering process is stopped when \(\uplambda _{t+N}=\uplambda _{t}\) or the number of iterations reaches the assumed level T. This algorithm will be denoted by \({\mathcal{U}}_5\).

Iteration algorithm 6: We keep the notation introduced earlier. In iteration \(t+1\), the population element \(k\in {U}_h^{(t)}\), where \(h=f_t(k)\), is moved to clusters \(U_z^{(t)}\), \(z\ne h\), \(z=1,...,G\). Next, we calculate the following

$$\begin{aligned} (\underline{k},\underline{z})=arg\left( min_{\{k\in {U}\}}min_{\{z\ne f_t(k),z=1,...,G\}}\left( \uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}}(k,z))\right) \right) \end{aligned}$$
(15)

where \(\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}}(k,z))\) is evaluated for partition \({\mathcal{U}}^{(t)}\) in which clusters \(U_z^{(t)}\) and \(U_h^{(t)}\) are replaced by \(\{U_z^{(t)}\cup \{k\}\}\) and \(\{U_h^{(t)}-\{k\}\}\), respectively, and \(h=f_t(k)\). If \(\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}}(\underline{k},\underline{z}))<\uplambda _t\), then \(\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t+1)}})=\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}}(\underline{k},\underline{z}))\) and \({\mathcal{U}}^{(t+1)}\) is equal to \({\mathcal{U}}^{(t)}\) where clusters \(U_{z}^{(t)}\) and \(U_h^{(t)}\) are replaced by \(U_{\underline{z}}^{(t+1)}=\{U_{\underline{z}}^{(t)}\cup \{k\}\}\) and \(U_h^{(t+1)}=\{U_h^{(t)}-\{k\}\}\), respectively. The iteration clustering process is stopped when \(\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}}(\underline{k},\underline{z}))\ge \uplambda _t\). The population clustered according to this algorithm will be denoted by \({\mathcal{U}}_6\).

3 Accuracy Analysis

Data about Swedish municipalities published in the monograph by [9] will be considered. Variables \(y_1\) and \(y_2\) are the real estate values (according to the 1984 assessment, in millions of kronor) and number of the municipal employees (in millions of kronor), respectively. Their population correlation coefficient is \(\rho _{y_1,y_2}=0.9924\). The population size (without outliers) is \(N=280\). Moreover, \(\bar{y}_{1,U}=51945.99\), \(\bar{y}_{2,U}=378859\), \(v_{y_1}=35954.39\), \(v_{y_2}=2008981\). The partitions obtained as results of the above clustering algorithms will be denoted by \({\mathcal{U}}_j\), \(j=1,...,6\). We will consider the sample sizes \(g=2,4,8,12,14,24\) and cluster sizes \(M=2,4,8,14\). The relative efficiency coefficient is evaluated according to expression (10) for estimation strategy \((\tilde{\varvec{y}}_S)\).

Table 1 Relative efficiency for the population partitioned into clusters.

Analysis of Table 1 leads to the following conclusions. Only under clustering algorithms \({\mathcal{U}}_1\) and \({\mathcal{U}}_2\), the accuracy of estimator \(\varvec{y}_S\) is approximately not less than the accuracy of estimator \(\tilde{\varvec{y}}_S\) for all considered combinations (Mg).

Partition \({\mathcal{U}}_4\) leads to the most efficient estimation based on \(\tilde{\varvec{y}}_S\). When we also assume that the population is split into sub-populations of the same sizes, estimator \(\tilde{\varvec{y}}_S\) based on the sample drawn from the population clustered according to algorithm \({\mathcal{U}}_3\) is the most efficient.

For algorithms \({\mathcal{U}}_1\) and \({\mathcal{U}}_2\), the estimation efficiency based on \(\tilde{\varvec{y}}_S\) decreases (or equivalently \(deff(\tilde{\varvec{y}}_S\)) increases) when the number of clusters g decreases under the fixed sample size n. For algorithms \({\mathcal{U}}_3\) - \({\mathcal{U}}_6\), the situation is reversed. Under the fixed sample size n, the estimation efficiency based on \(\tilde{\varvec{y}}_S\) increases when number of clusters g decreases. For instance, under partition \({\mathcal{U}}_4\) when \((M,g)=(2,14)\) and \((M,g)=(14,2)\), the accuracy of \(\tilde{\varvec{y}}_S\) is almost two times and fifty times better than the accuracy of \(\varvec{y}_S\), respectively.

4 Conclusions

In this paper, we have shown that it is possible to significantly increase the accuracy of estimating population totals using vector estimator from a simple cluster sample drawn without replacement by considering specific partition of a population into clusters. In the analysed empirical example, algorithm 5 and 6 lead to the optimal partition of the population. These algorithms should work quickly when a population size is large. The results could be useful for panel or census survey sampling repeated on more than one occasion. The results of paper could be applied to partitioning a population into clusters based on census data. This could improve accuracy of estimation of population total vector. Moreover, the results could be useful in some aspects of big-data analysis.

This paper could be treated as a contribution to comparison of vector estimators. Several properties of the generalized relative efficiency coefficient are considered in Theorem 2. The generalized coefficient of intra-cluster data spread homogeneity was defined, its properties were considered and its values were interpreted. The generalized deff coefficient was also written as the function of matrix of coefficients of intra-cluster homogeneity. The proposed procedures could be developed in several ways. Other clustering algorithms could be considered. In particular, the clustering procedures based on multivariate variables that are proposed in this paper could be reduced to one-dimensional cases. For instance, these variables could be replaced with their principal component. In this case, the several clustering procedures based on one-dimensional variables that have been proposed by [14] could be adopted in our considerations.

In addition, many of the clustering algorithms available in the statistical literature (see, e.g. [1, 6]) divide the population into homogeneous clusters. Typically, these procedures can be modified to algorithms that ensure the maximum spread of multivariate observations within the cluster. This seems to the well-known nearest (farthest) neighbour criteria. Properties of some sampling designs used in spatial statistics could inspirate for the construction of clustering algorithms. For example, the criteria considered by Thompson and Seber [10] or [13] can be adapted to divide the spatial population into clusters composed of non-neighbours.