The Influence of Clustering Population on Estimation Accuracy of Population Totals Vector

Wywiał, Janusz L.; Sitek, Grzegorz

doi:10.1007/s42519-021-00196-x

The Influence of Clustering Population on Estimation Accuracy of Population Totals Vector

Original Article
Open access
Published: 21 May 2021

Volume 15, article number 61, (2021)
Cite this article

Download PDF

You have full access to this open access article

Journal of Statistical Theory and Practice Aims and scope Submit manuscript

The Influence of Clustering Population on Estimation Accuracy of Population Totals Vector

Download PDF

1349 Accesses
Explore all metrics

Abstract

The estimation accuracy of a population totals vector based on a simple cluster sample is considered. The variance-covariance matrix of estimators depends on the intra-cluster spread of variables under study. The spread depends on the partition of the population into clusters. The variance-covariance matrix is evaluated under several variants of clustering algorithm. This lets us find the clustering algorithm providing the most accurate estimation of population vector totals.

A Constrained Cluster Analysis with Homogeneity of External Criterion

Improved Family of Estimators of Population Variance in Simple Random Sampling

Article 01 June 2015

A Bayesian Approach to Cluster Sampling Under Simple Random Sampling

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Research into practical survey sampling is usually based on vector parameters. The purpose of this paper is to simultaneously estimate the population totals of at least two variables. The well-known vector estimator from a simple cluster sample drawn without replacement is considered. Its accuracy is compared with the ordinary vector estimator from a simple random sample drawn without replacement using the variance-covariance matrix. We analyse accuracy of vector estimators using the methodology proposed by Borovkov [2], Jensen [4], Rao [7] and the generalized relative efficiency coefficient proposed by Rao [8]. This coefficient is defined as the maximal eigenvalue of the product of two matrices. In our case, one of the matrices is the variance-covariance matrix of the vector estimator from the cluster sample and the other is the inverse of the variance-covariance matrix of the vector estimator from the simple random sample. Let us add that Wywial [14] analysed the accuracies of estimation procedures of the total based on samples selected according to several sampling designs from a population clustered using several algorithms. This paper is in some sense a generalization of their considerations into simultaneous estimation of totals of at least two variables under study.

The accuracy of the vector estimation of totals based on the cluster sample depends on partitioning a population into clusters. The influence of the partitioning a population on accuracy of the estimation is considered. This problem was analysed by means of the several methods of measuring vector estimation.

The obtained results should be useful in survey sampling conducted, e.g. by statistical offices. In this case, census data are usually available. Variables under study (observed during a census) can be used as auxiliary variables in a survey sampling on a subsequent occasion. In this situation, appropriate clustering of the population considered in this paper should contribute to improving future sampling strategies.

One of the aspects of big-data analysis is the problem of reduction in the data (observations of variables) number. The results of this paper partially contribute to solve this problem because considered methods provide partitions of a population into such clusters that each of them is similar to the population as close as it is possible. More precisely, the spreads of data observations in the clusters are not less then the spread of the data in the population.

The main results of this paper are as follows:

the variance-covariance matrix of the vector estimation of totals from the simple cluster sample is shown as a function of the matrix of homogeneity coefficients, Sect. 2.2 and “Appendix”,
properties of the homogeneity matrix let us show when the vector estimator from the cluster sample is more accurate than the vector estimator from the simple sample , Sect. 2.3 and “Appendix”,
several algorithms for partitioning a population into mutually disjoint and non-empty clusters are proposed, Sect. 2.4,
using these algorithms to partition of the population of Swedish municipalities into clusters, Sect. 3,
for theses partitions values of the generalized coefficient of the relative efficiency of the vector estimator from the cluster sample are evaluated what let us analyse influence the partition of the population on the accuracy estimation, Sect. 3.

2 Estimation Based on Cluster Sample

2.1 Basic Notations

Let U be a population of size N. The number of variables observed in U is denoted by m. Observations of a vector variable will be denoted by $\varvec{y}_k=[y_{k,1}...y_{k,m}]$ where $ k \in U $. Let us assume that the population is partitioned into disjoint sub-population $U_h$ of sizes $N_h$, $h=1,...,G$, called clusters. Hence, $N=\sum _{h=1}^{G}N_h$ and $\bar{N}=G^{-1}\sum _{h=1}^{G}N_h$. Let $\bar{N}=M$, if all the clusters are of the same size. Let ${\mathcal{U}}=\{U_1,...,U_h,...,U_G\}$ be a partition of the population elements into clusters. Hence, ${\mathcal{U}}$ is the set of G-mutually disjoint and non-empty clusters. Let

$$\begin{aligned}&\bar{\varvec{y}}=[\bar{y}_1...\bar{y}_m]=\sum _{k\in {U}}\varvec{y}_k/N,\quad \varvec{y}_U=N\bar{\varvec{y}}=\sum _{k\in {U}}\varvec{y}_k=[y_{U,1}...y_{U,m}], \\&y_{U,i}=\sum _{k\in {U}}y_{k,i}, \quad \varvec{C}=[c_{i,j}],\quad c_{i,j}=\sum _{k\in {U}}(y_{k,i}-\bar{y}_i)(y_{k,j}-\bar{y}_j)/(N-1), \\&\varvec{R}=\varvec{D}^{-1/2}\varvec{C}\varvec{D}^{-1/2}=[r_{i,j}],\quad \varvec{D}=[v_i],\quad r_{i,j}=\frac{c_{i,j}}{\sqrt{v_iv_j}},\quad v_i=c_{i,i} \end{aligned}$$

where $\bar{\varvec{y}}$ is the vector of population means, $\varvec{y}_U$ is the vector of totals, $\varvec{C}$ is the matrix of variances and covariances, $\varvec{R}$ is the matrix of correlation coefficients and $\varvec{D}$ is the diagonal matrix of variances. Moreover, let

$$\begin{aligned}&\bar{\varvec{y}}_{U_h}=\sum _{k\in {U}_h}\varvec{y}_k/N_h, \quad \bar{\varvec{y}}_{U_h}=[\bar{y}_{U_h,1}...\bar{y}_{U_h,m}],\quad \bar{y}_{U_h,i}=\sum _{k\in {U}_h}y_{k,i}/N_h, \\&\varvec{y}_{U_h}=N_h\bar{\varvec{y}}_{U_h}=\sum _{k\in {U}_h}\varvec{y}_k=[y_{U_h,1}...y_{U_h,m}],\quad y_{U_h,i}=\sum _{k\in {U}_h}y_{k,i}, \\&\varvec{C}_{U_h}=[c_{U_h,i,j}],\quad c_{U_h,i,j}=\sum _{k\in {U}_h}(y_{k,i}-\bar{y}_{U_h,i})(y_{k,j}-\bar{y}_{U_h,j})/(N_h-1), \\&\bar{\varvec{y}}_{{\mathcal{U}}}=\sum _{h=1}^G\varvec{y}_{U_h}/G=\varvec{y}_U/G=[\bar{y}_{{\mathcal{U}},1}...\bar{y}_{{\mathcal{U}},m}],\;\; \bar{y}_{{\mathcal{U}},i}=\sum _{h=1}^Gy_{U_h,i}/G=y_{U,i}/G, \\&\varvec{y}_{{\mathcal{U}}}=G\bar{\varvec{y}}_{{\mathcal{U}}}=\sum _{h=1}^G\varvec{y}_{U_h}=\varvec{y}_U,\quad \varvec{C}_{\mathcal{U}}=[c_{{\mathcal{U}},i,j}], \\&c_{{\mathcal{U}},i,j}=\sum _{h=1}^G({y}_{U_h,i}-\bar{y}_{{\mathcal{U}},i})({y}_{U_h,j}-\bar{y}_{{\mathcal{U}}},j)/(G-1) \end{aligned}$$

where $k\in {U}$, $i=1,...,m$, $j=1,...,m$, $h=1,...,G$. $\bar{y}_{U_h,i}$ is the mean of the i-th variable in the h-cluster, $y_{U_h,i}$ is the total of the i-th variable in the h-th cluster, $\varvec{C}_h$ is the variance-covariance matrix in the h-th cluster, and $\bar{\varvec{y}}_{{\mathcal{U}}}$ is the vector of the means of the cluster totals, $\varvec{C}_{\mathcal{U}}$ is the variance-covariance matrix of the cluster totals.

2.2 Simple Cluster Sampling

Let sample s be drawn from population U or partition ${\mathcal{U}}$. The random sample will be denoted by the capital letter S while its observation by s.

Cluster sample s is defined as a g-element set of clusters $U_h$ drawn from partition ${\mathcal{U}}$. The well-known simple cluster sampling design is defined as $P_1(s)=\left( {\begin{array}{c}G\\ g\end{array}}\right) ^{-1}$ where $s\in \varvec{S}_{\mathcal{U}}$ and $\varvec{S}_{\mathcal{U}}$ is the sampling space generated for set ${\mathcal{U}}$. The vector of the unbiased estimator of the population total $\varvec{y}_U$ is as follows:

$$\begin{aligned} \tilde{\varvec{y}}_S=\frac{G}{g}\sum _{h\in S}\sum _{k\in {U}_h}\varvec{y}_k=\frac{G}{g}\sum _{h\in S}\varvec{y}_{U_h}, \end{aligned}$$

(1)

Its variance-covariance matrix is:

$$\begin{aligned} \varvec{V}(\tilde{\varvec{y}}_S)=\frac{G(G-g)}{g}\varvec{C}_{\mathcal{U}} \end{aligned}$$

(2)

where $\tilde{\varvec{y}}_S$ is evaluated on the basis of the simple cluster sample drawn without replacement. Generalizing the results of [9, pp. 129–133] into multidimensional case, we derived in “Appendix” the following expression (see also [11, 12]):

$$\begin{aligned} \varvec{V}(\tilde{\varvec{y}}_S)=\frac{G(G-g)}{g}\bar{N}\varvec{C}\left( \varvec{I}_m+\frac{N-G}{G-1}\varvec{\varDelta }\right) +\frac{G(G-g)}{g}\varvec{A} \end{aligned}$$

(3)

where:

$$\begin{aligned}&\varvec{\varDelta }=\varvec{I}_m-\varvec{C}^{-1}\varvec{C}_*, \nonumber \\&\varvec{A}=[a_{i,j}];\quad a_{i,j}=\frac{1}{G-1}\sum _{h=1}^G(N_h-\bar{N})N_h\bar{y}_{U_h,i}\bar{y}_{U_h,j},\quad i\ne j=1,...,m \end{aligned}$$

(4)

or $\varvec{A}=\varvec{A}_1+\varvec{A}_2+\varvec{A}_3$,

$$\begin{aligned}&\varvec{A}_1=[a_{i,j}(111)],\quad \varvec{A}_2=[\bar{y}_{{\mathcal{U}},i}a_{i,j}(101)],\quad \varvec{A}_3=[\bar{y}_ja_{i,j}(110)], \nonumber \\&a_{i,j}(bed)=\frac{1}{G-1}\sum _{h=1}^G(N_h-\bar{N})^b(y_{U_h,i}-\bar{y}_{{\mathcal{U}},i})^e(\bar{y}_{U_h,j}-\bar{y}_j)^d, \nonumber \\&\varvec{C}_*=[c_{*i,j}],\quad c_{*i,j}=\frac{1}{N-G}\sum _{h=1}^G\sum _{k\in {U}_h}(y_{k,i}-\bar{y}_{U_h,i})(y_{k,j}-\bar{y}_{U_h,j}), \end{aligned}$$

(5)

or

$$\begin{aligned} c_{*i,j}=\sum _{h=1}^Gw_hc_{*,U_h,i,j},\quad c_{*,U_h,i,j}=\frac{1}{N_h-1}\sum _{k\in {U}_h}(y_{k,i}-\bar{y}_{U_h,i})(y_{k,j}-\bar{y}_{U_h,j}) \end{aligned}$$

and $w_h=\frac{N_h-1}{N-G}$. Parameter $\varvec{\varDelta }$ is the matrix of the coefficients of intra-cluster data spread homogeneity or simply the homogeneity matrix. The intra-cluster variance-covariance matrix is denoted by $\varvec{C}_*$. Let us underline that when $N_h=M$ for all $h=1,...,H$, then $\varvec{A}=\varvec{O}$. Sarndal et al. [9] proved that all diagonal elements of $\varvec{\varDelta }$ take values from $\left[ -\frac{G-1}{N-G};1\right]$. Let $\delta$ be an eigenvalue of $\varvec{\varDelta }$. In the last part of the “Appendix” is proved the following inequality:

$$\begin{aligned} -\frac{G-1}{N-G}\le \delta \le 1. \end{aligned}$$

(6)

Kish [5] provided sound advices on grouping problems that might be encountered in practical surveys.

2.3 Relative Efficiency

Let $\varvec{t}_{1s}$ and $\varvec{t}_{2s}$ be the unbiased estimators of vector parameter $\varvec{\theta }\in \varvec{\varTheta }$. Borovkov [2] proposed comparing the accuracy of vector estimators using the following definition (see also [7] or [12], pp. 28–29):

Definition 1

Estimator $\varvec{t}_{1s}$ is not worse than $\varvec{t}_{2s}$if and only if:

$$\begin{aligned} \forall _{\varvec{\alpha }\ne \varvec{0}}\forall _{\varvec{\theta }\in \varvec{\varTheta }}\;\;v(\varvec{t}_{1s}\varvec{\alpha }^T)\le v(\varvec{t}_{2s}\varvec{\alpha }^T) \end{aligned}$$

where $\varvec{\alpha }=[\alpha _1...\alpha _m]$,

$$\begin{aligned} v(\varvec{t}_{2s}\varvec{\alpha }^T)= \varvec{\alpha }\varvec{V}(\varvec{t}_{is})\varvec{\alpha }^T, \quad i=1,2. \end{aligned}$$

Estimator $\varvec{t}_{1s}$ is better than $\varvec{t}_{2s}$ if and only if $\varvec{t}_{1s}$ is not worse than $\varvec{t}_{2s}$and the above inequality becomes sharp for at least one fixed parameter $\varvec{\theta }$.

This definition directly leads to the following, see [7] and Borovkov [2]:

Theorem 1

Let the variance-covariance matrices $\varvec{V}(\varvec{t}_{is})$, $i=1,2$ be positive definite. If estimator $\varvec{t}_{1s}$ is not worst than $\varvec{t}_{2s}$ , then $\varvec{V}(\varvec{t}_{2s})-\varvec{V}(\varvec{t}_{1s})$ is non-negative definite and:

$$\begin{aligned}&tr\left( \varvec{V}(\varvec{t}_{1s})\right) \le tr\left( \varvec{V}(\varvec{t}_{2s})\right) , \\&det\left( \varvec{V}(\varvec{t}_{1s})\right) \le det\left( \varvec{V}(\varvec{t}_{2s})\right) , \\&\uplambda \left( \varvec{V}(\varvec{t}_{1s})\right) \le \uplambda \left( \varvec{V}(\varvec{t}_{2s})\right) , \\&\forall _{j=1,...,m}\;\;v(t_{1,is}))\le v(t_{2,is}) \end{aligned}$$

where $tr\left( \varvec{V}(\varvec{t}_{is})\right)$, $det\left( \varvec{V}(\varvec{t}_{is})\right)$ and $\uplambda \left( \varvec{V}(\varvec{t}_{is})\right)$ are called the mean square radius, the generalized variance and the the spectral radius (maximal eigenvalue of $\varvec{V}(\varvec{t}_{is})$) of the vector estimator $\varvec{t}_{is}$, while $v(t_{i,js})$ is variance of j-th component of $\varvec{t}_{is}$. The above inequalities become sharp, when $\varvec{V}(\varvec{t}_{1s})-\varvec{V}(\varvec{t}_{2s})$ is positive definite.

The accuracy of estimator $\tilde{\varvec{y}}_S$ is compared with the accuracy of the following well-known estimator of the vector of totals from an ordinary simple random sample drawn without replacement from a whole population:

$$\begin{aligned} \varvec{y}_S=\frac{N}{n}\sum _{k\in S}\varvec{y}_k,\qquad \varvec{V}(\varvec{y}_S)=\frac{N(N-n)}{n}\varvec{C} \end{aligned}$$

(7)

where S is drawn without replacement according to sampling design: $P_0(s)=\left( {\begin{array}{c}N\\ n\end{array}}\right) ^{-1}$, $s\in \varvec{S}$ and $\varvec{S}$ is the sampling space generated for U. Under the assumption that $n=g\bar{N}$, we have:

$$\begin{aligned} \varvec{V}(\tilde{\varvec{y}}_s)-\varvec{V}(\varvec{y}_S)=\left( \varvec{I}_m+\frac{N-G}{G-1}\varvec{\varDelta }\right) \left( \varvec{C}+\bar{N}\varvec{C}\right) = \frac{G(G-g)}{g}\left( \varvec{C}_{\mathcal{U}}-\bar{N}\varvec{C}\right) . \end{aligned}$$

(8)

According to Theorem 1, the estimator $\tilde{\varvec{y}}_s$ is not worse than $\varvec{y}_s$, when $\varvec{C}_{\mathcal{U}}-\bar{N}\varvec{C}$ is non-positive definite.

Particularly, when $N_h=M$ for all $h=1,...,G$, expressions (4), (7) and (8) let us write:

$$\begin{aligned}&\varvec{V}(\tilde{\varvec{y}}_s)-\varvec{V}(\varvec{y}_S)=\frac{N(N-n)}{N}\frac{N-G}{G-1}\varvec{C}\varvec{\varDelta }=\nonumber \\&=\frac{N(N-n)}{N}\frac{N-G}{G-1}(\varvec{C}-\varvec{C}_*). \end{aligned}$$

(9)

If $N_h=M$ for all $h=1,...,G$, the estimator $\tilde{\varvec{y}}_s$ is not worse than $\varvec{y}_s$, when $\varvec{C}\varvec{\varDelta }$ is non-positive definite.

The following theorem is proved in the “Appendix”

Theorem 2

Let the variance-covariance matrices $\varvec{V}(\varvec{t}_{is})$, $i=1,2$ be positive definite. If estimator $\varvec{t}_{1s}$ is not worse than $\varvec{t}_{2s}$, then $\varvec{V}(\varvec{t}_{2s})-\varvec{V}(\varvec{t}_{1s})$ is non-negative definite and:

$$\begin{aligned}&\uplambda \left( \varvec{V}(\varvec{t}_{2s})\varvec{V}^{-1}(\varvec{t}_{1s})\right) =\uplambda \left( \varvec{V}^{-1}(\varvec{t}_{1s})\varvec{V}(\varvec{t}_{2s})\right) \ge 1, \\&\uplambda \left( \varvec{V}^{-1}(\varvec{t}_{2s})\varvec{V}(\varvec{t}_{1s})\right) =\uplambda \left( \varvec{V}(\varvec{t}_{1s})\varvec{V}^{-1}(\varvec{t}_{2s})\right) \le 1 \\&\quad \uplambda _1\left( \varvec{V}(\varvec{t}_{2s})\varvec{V}^{-1}(\varvec{t}_{1s})\right) \le \frac{\varvec{\alpha }\varvec{V}(\varvec{t}_{2s})\varvec{\alpha }^T}{\varvec{\alpha }\varvec{V}(\varvec{t}_{1s})\varvec{\alpha }^T}=\frac{\varvec{V}(\varvec{t}_{2s}\varvec{\alpha }^T)}{\varvec{V}(\varvec{t}_{1s}\varvec{\alpha }^T)}\le \uplambda \left( \varvec{V}^{-1}(\varvec{t}_{1s})\varvec{V}(\varvec{t}_{2s})\right) . \end{aligned}$$

for all $\varvec{\alpha }\ne \varvec{0}$ where $\uplambda _1(...)$ is the minimal eigenvalue of a matrix. The above inequalities become sharp, when $\varvec{V}(\varvec{t}_{1s})-\varvec{V}(\varvec{t}_{2s})$ is positive definite.

When the clusters are of the same size, Theorem 1 and expressions (9) let us conclude that when matrix $\varvec{\varDelta }$ is non-positive (non-negative) definite, then estimator $\tilde{\varvec{y}}_s$ is not worse (not better) than $\varvec{y}_s$.

Rao and Scott [8, pp. 223], define the generalized relative efficiency coefficient as follows:

$$\begin{aligned} deff(\varvec{t}_S)=\uplambda \left( \varvec{V}(\varvec{y}_S)^{-1}\varvec{V}(\varvec{t}_S)\right) . \end{aligned}$$

(10)

where $\varvec{V}(\varvec{y}_S)$ is non-singular. When $n=g\bar{N}$, expressions (2), (3) and (10) lead to the following:

$$\begin{aligned} deff(\tilde{\varvec{y}}_S)=\frac{G(G-g)n\bar{N}}{N(N-n)g}\uplambda \left( \varvec{C}^{-1}\varvec{C}_{\mathcal{U}}\right) =1+\uplambda \left( \frac{N-G}{G-1}\varvec{\varDelta }+\frac{1}{\bar{N}}\varvec{C}^{-1}\varvec{A}\right) . \end{aligned}$$

(11)

Hence, $deff(\tilde{\varvec{y}}_S)=minim$ when the population is partitioned into set ${\mathcal{U}}$ of clusters in such a way that $\uplambda (\varvec{C}^{-1}\varvec{C}_{\mathcal{U}})=minim$.

In particular, expressions (3) and (4) show that when $N_h=M$ for all $h=1,...,H$, then $\varvec{A}=\varvec{0}$. Inequality $-\frac{G-1}{N-G}\le \uplambda (\varvec{\varDelta })\le 1$ leads to the following:

$$\begin{aligned} 0\le deff(\tilde{\varvec{y}}_S)=1+\frac{N-G}{G-1}\uplambda (\varvec{\varDelta })\le \frac{N-1}{G-1}. \end{aligned}$$

(12)

When $\uplambda (\varvec{\varDelta })\le 0$, then $\tilde{\varvec{y}}_S$ is more efficient than $\varvec{y}_S$. Hence, we should partition the population into clusters of the same size in such a way that coefficient $\uplambda (\varvec{\varDelta })$ takes the minimal negative value.

2.4 Clustering Algorithms

We can expect that variables observed in a finite and fixed population in a past occasion are highly correlated with the appropriate variables observed in a current occasion or in future occasions. Therefore, census data could be used to construct reasonable sampling design for future occasion.

The above considerations lead to the conclusion that the population has to be clustered in such a way that the maximal eigenvalue of $\varvec{C}^{-1}\varvec{C}_{\mathcal{U}}$ takes the minimal value. Additionally, when we assume that the population has to be partitioned into clusters of the same size, then minimization of $\uplambda (\varvec{\varDelta })$ is the criterion for population clustering. The following clustering algorithms will be considered:

Systematic algorithm 1:

Let us assume that $\varvec{y}_k>\varvec{0}$ for all $k=1,...,N$. Next, we evaluate squared distances $d_k=\varvec{y}_k\varvec{y}_k^T$ of $\varvec{y}_k$ from the zero vector $\varvec{0}$ for all $k\in {U}$. Let us assume that $d_k\le d_{k+1}$ for $k=1,...,N-1$. The h-th cluster is identified by the unit labels $k\in {U}_h$ that $k=(i-1)G+h$, for $i=1,...,M$ and $h=1,...,G$. This leads to inequalities: $d_{U_h}\le d_{U_{h+1}}$ for $h=1,...,G-1$ where $d_{U_h}=\sum _{k\in {U}_h}d_k$. The result of this clustering algorithm will be denoted by ${\mathcal{U}}_1$. In some sense, this result is the well-known systematic simple sample space.

Systematic algorithm 2: Let $d_k=(\varvec{y}_k-\bar{\varvec{y}})(\varvec{y}_k-\bar{\varvec{y}})^T$ be the squared distance of $\varvec{y}_k$ from vector $\bar{\varvec{y}}$ for all $k\in {U}$. Let us assume that $d_k\le d_{k+1}$ for $k=1,...,N-1$. Let $M=2$ and $N=MG$. In this case, $U_h=\{h;N-h+1\}$ for $h=1,...,G$. In general, when M is even and $N=MG$, then $U_h=\{(h-1)\frac{M}{2}+i;N-(h-1)\frac{M}{2}-i+1\}$ for $h=1,...,G$ and $i=1,...M/2$. The result of this clustering algorithm will be denoted by ${\mathcal{U}}_2$.

Permutation algorithm 3: Let ${\mathcal{U}}^{(0)}=\{U_1^{(0)},...,U_G^{(0)}\}$ be any start partition of a population partitioned into clusters of the same sizes, $M\ge m$. In the t-th (t=0,1,...) iteration partition ${\mathcal{U}}^{(t)}=\{U_1^{(t)},...,U_G^{(t)}\}$ is generated through permutating population elements. For an assumed $t=T$, ${\mathcal{U}}^{(T)}$ is treated as optimal when

$$\begin{aligned} \uplambda _*({\mathcal{U}}^{(T)})= min_{\{t=1,..,T\}}(\uplambda (\varDelta ({\mathcal{U}}^{(t)}))). \end{aligned}$$

(13)

Iteration algorithm 4: Let ${\mathcal{U}}^{(0)}=\{U_1^{(0)},...,U_G^{(0)}\}$ be any start partition of the population partitioned into clusters which are not necessary of the same size. Let ${\mathcal{U}}^{(t)}=\{U_1^{(t)},...,U_G^{(t)}\}$ be the partition of the population obtained as result of the t-th iteration and let $\uplambda _t=\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}})$ be the maximal eigenvalue of the variance-covariance matrix of cluster totals. Moreover, let $f: U\rightarrow {\mathcal{U}}^{(t)}$, $f_t(k)=h$, if and only if $k\in {U}_h^{(t)}$.

In iteration $t+1$, we randomly choose number $k_*$ of data observation from the sequences 1, ..., N. Next, element $k_*$ is moved from the cluster $h_\#=f_t(k_*)$ to cluster $h_*$ where $h_*$ is randomly drawn from set $\{h:h=1,...,G; h\ne h_\#\}$. This leads to the new partition ${\mathcal{U}}^{(t+1)}$. Finally, we count $\uplambda _{t+1}=\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t+1)}})$. If $\uplambda _{t+1}<\uplambda _t$, then ${\mathcal{U}}^{(t+1)}$ is the current partition and we start iteration $t+2$ of the algorithm. If $\uplambda _{t+1}\ge \uplambda _t$, then we start stage $t+2$ of the algorithm from partition ${\mathcal{U}}^{(t)}$. The algorithm of the partition is stopped when the number of iterations reaches the assumed level T. This algorithm leads to the minimization of $deff(\tilde{\varvec{y}}_S)$. The population clustered according to this algorithm will be denoted by ${\mathcal{U}}_4$.

Iteration algorithm 5: The clustering procedure described below is similar to the above one and also leads to minimization of $\varvec{V}(\tilde{\varvec{y}}_S)$.

Let ${\mathcal{U}}^{(t)}=\{U_1^{(t)},...,U_G^{(t)}\}$ be the partition of the population obtained as result of the t-th iteration where $t=(l-1)N+k$, $k=1,...,N$, $l=1,2,...$ and let $\uplambda _t=\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}})$ be the maximal eigenvalue evaluated on the basis of ${\mathcal{U}}^{(t)}=\{U_1^{(t)},...,U_G^{(t)}\}$. Let $f: U\rightarrow {\mathcal{U}}^{(t)}$, $f_t(l)=h$, if and only if $l\in {U}_h^{(k,t)}$.

In stage $t+1$, the population element $k\in {U}_h^{(t)}$, where $h=f_t(k)$, is moved to clusters $U_z^{(t)}$, $z\ne h$, $z=1,...,G$ and calculated using the following

$$\begin{aligned} (k,\underline{z})=arg\left( min_{\{z=1,...,G,z\ne f_{t}(k)\}}\left( \uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}}(k,z))\right) \right) \end{aligned}$$

(14)

where $\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}}(k,z))$ is evaluated for partition ${\mathcal{U}}^{(t)}$ in which clusters $U_z^{(t)}$, $U_h^{(t)}$ are replaced by $\{U_z^{(t)}\cup \{k\}\}$ and $\{U_h^{(t)}-\{k\}\}$, respectively, and $h=f_{t}(k)$. If $\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}}(\underline{z}))<\uplambda _{t}$, then $\uplambda _{t+1}=\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t+1)}})$ and ${\mathcal{U}}^{(t+1)}$ is equal to ${\mathcal{U}}^{(t)}$ where clusters $U_{\underline{z}}^{(t)}$ and $U_{\underline{h}}^{(t)}$ are replaced by $U_{\underline{z}}^{(t+1)}=\{U_{\underline{z}}^{(t)}\cup \{k\}\}$ and $U_{h}^{(t+1)}=\{U_h^{(t)}-\{k\}\}$, respectively. If $\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}}(\underline{z}))\ge \uplambda _{t}$, then ${\mathcal{U}}^{(t+1)}={\mathcal{U}}^{(t)}$ and $\uplambda _{t+1}=\uplambda _{t}$.

The iteration clustering process is stopped when $\uplambda _{t+N}=\uplambda _{t}$ or the number of iterations reaches the assumed level T. This algorithm will be denoted by ${\mathcal{U}}_5$.

Iteration algorithm 6: We keep the notation introduced earlier. In iteration $t+1$, the population element $k\in {U}_h^{(t)}$, where $h=f_t(k)$, is moved to clusters $U_z^{(t)}$, $z\ne h$, $z=1,...,G$. Next, we calculate the following

$$\begin{aligned} (\underline{k},\underline{z})=arg\left( min_{\{k\in {U}\}}min_{\{z\ne f_t(k),z=1,...,G\}}\left( \uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}}(k,z))\right) \right) \end{aligned}$$

(15)

where $\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}}(k,z))$ is evaluated for partition ${\mathcal{U}}^{(t)}$ in which clusters $U_z^{(t)}$ and $U_h^{(t)}$ are replaced by $\{U_z^{(t)}\cup \{k\}\}$ and $\{U_h^{(t)}-\{k\}\}$, respectively, and $h=f_t(k)$. If $\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}}(\underline{k},\underline{z}))<\uplambda _t$, then $\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t+1)}})=\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}}(\underline{k},\underline{z}))$ and ${\mathcal{U}}^{(t+1)}$ is equal to ${\mathcal{U}}^{(t)}$ where clusters $U_{z}^{(t)}$ and $U_h^{(t)}$ are replaced by $U_{\underline{z}}^{(t+1)}=\{U_{\underline{z}}^{(t)}\cup \{k\}\}$ and $U_h^{(t+1)}=\{U_h^{(t)}-\{k\}\}$, respectively. The iteration clustering process is stopped when $\uplambda (\varvec{C}^{-1}\varvec{C}_{{\mathcal{U}}^{(t)}}(\underline{k},\underline{z}))\ge \uplambda _t$. The population clustered according to this algorithm will be denoted by ${\mathcal{U}}_6$.

3 Accuracy Analysis

Data about Swedish municipalities published in the monograph by [9] will be considered. Variables $y_1$ and $y_2$ are the real estate values (according to the 1984 assessment, in millions of kronor) and number of the municipal employees (in millions of kronor), respectively. Their population correlation coefficient is $\rho _{y_1,y_2}=0.9924$. The population size (without outliers) is $N=280$. Moreover, $\bar{y}_{1,U}=51945.99$, $\bar{y}_{2,U}=378859$, $v_{y_1}=35954.39$, $v_{y_2}=2008981$. The partitions obtained as results of the above clustering algorithms will be denoted by ${\mathcal{U}}_j$, $j=1,...,6$. We will consider the sample sizes $g=2,4,8,12,14,24$ and cluster sizes $M=2,4,8,14$. The relative efficiency coefficient is evaluated according to expression (10) for estimation strategy $(\tilde{\varvec{y}}_S)$.

Table 1 Relative efficiency for the population partitioned into clusters.

Full size table

Analysis of Table 1 leads to the following conclusions. Only under clustering algorithms ${\mathcal{U}}_1$ and ${\mathcal{U}}_2$, the accuracy of estimator $\varvec{y}_S$ is approximately not less than the accuracy of estimator $\tilde{\varvec{y}}_S$ for all considered combinations (M, g).

Partition ${\mathcal{U}}_4$ leads to the most efficient estimation based on $\tilde{\varvec{y}}_S$. When we also assume that the population is split into sub-populations of the same sizes, estimator $\tilde{\varvec{y}}_S$ based on the sample drawn from the population clustered according to algorithm ${\mathcal{U}}_3$ is the most efficient.

For algorithms ${\mathcal{U}}_1$ and ${\mathcal{U}}_2$, the estimation efficiency based on $\tilde{\varvec{y}}_S$ decreases (or equivalently $deff(\tilde{\varvec{y}}_S$) increases) when the number of clusters g decreases under the fixed sample size n. For algorithms ${\mathcal{U}}_3$ - ${\mathcal{U}}_6$, the situation is reversed. Under the fixed sample size n, the estimation efficiency based on $\tilde{\varvec{y}}_S$ increases when number of clusters g decreases. For instance, under partition ${\mathcal{U}}_4$ when $(M,g)=(2,14)$ and $(M,g)=(14,2)$, the accuracy of $\tilde{\varvec{y}}_S$ is almost two times and fifty times better than the accuracy of $\varvec{y}_S$, respectively.

4 Conclusions

In this paper, we have shown that it is possible to significantly increase the accuracy of estimating population totals using vector estimator from a simple cluster sample drawn without replacement by considering specific partition of a population into clusters. In the analysed empirical example, algorithm 5 and 6 lead to the optimal partition of the population. These algorithms should work quickly when a population size is large. The results could be useful for panel or census survey sampling repeated on more than one occasion. The results of paper could be applied to partitioning a population into clusters based on census data. This could improve accuracy of estimation of population total vector. Moreover, the results could be useful in some aspects of big-data analysis.

This paper could be treated as a contribution to comparison of vector estimators. Several properties of the generalized relative efficiency coefficient are considered in Theorem 2. The generalized coefficient of intra-cluster data spread homogeneity was defined, its properties were considered and its values were interpreted. The generalized deff coefficient was also written as the function of matrix of coefficients of intra-cluster homogeneity. The proposed procedures could be developed in several ways. Other clustering algorithms could be considered. In particular, the clustering procedures based on multivariate variables that are proposed in this paper could be reduced to one-dimensional cases. For instance, these variables could be replaced with their principal component. In this case, the several clustering procedures based on one-dimensional variables that have been proposed by [14] could be adopted in our considerations.

In addition, many of the clustering algorithms available in the statistical literature (see, e.g. [1, 6]) divide the population into homogeneous clusters. Typically, these procedures can be modified to algorithms that ensure the maximum spread of multivariate observations within the cluster. This seems to the well-known nearest (farthest) neighbour criteria. Properties of some sampling designs used in spatial statistics could inspirate for the construction of clustering algorithms. For example, the criteria considered by Thompson and Seber [10] or [13] can be adapted to divide the spatial population into clusters composed of non-neighbours.

References

Bock HH (2002) Clustering methods: from classical models to new approaches. Stat Trans 5:725–728
Google Scholar
Borovkov AA (1984) Mathematical statistics. Estimation. Testing hypotheses. Nauka Moskva (in Russian)
Hardville DA (1997) Matrix algebra from a statistician’s perspective. Springer, New York
Book Google Scholar
Jensen DR (1991) Vector efficiency in multiparameter estimation. Linear Algebra Appl 151:143–155
Article MathSciNet Google Scholar
Kish L (1995) Survey sampling. Wiley, New York
MATH Google Scholar
Mosler K (2012) Multivariate dispersion, central regions, and depth. Springer, New York
MATH Google Scholar
Rao CR (1973) Linear statistical inference and its applications. Wiley, New York
Book Google Scholar
Rao JNK, Scott AJ (1981) The analysis of categorical data from complex sample surveys: chi-squared tests of goodness of fit and independence in two-way tables. J Am Stat Assoc 76(374):221–230
Article MathSciNet Google Scholar
Särndal CE, Swensson B, Wretman J (1992) Model assisted survey sampling. Springer, Berlin
Book Google Scholar
Thompson SK, Seber GAF (1996) Adaptive sampling. Wiley, New York
MATH Google Scholar
Wywiał JL (2002) On estimation of population average on the basis of cluster sample. In: Jajuga K, Sokołowski A, Bock HH (eds) Classification, clustering, and data analysis. Springer, Berlin, pp 271–277
Chapter Google Scholar
Wywiał J (2003) Some contributions to multivariate methods in survey sampling. University of Economics in Katowice, Katowice. https://www.ue.katowice.pl/fileadmin/user_upload/wydawnictwo/Darmowe_E-Booki/Wywial_Some_Contributions_To_Multivariate_Methods_In_Survey_Sampling.pdf
Wywiał JL (2013) On space sampling. Acta Universitatis Lodziensis Folia Oeconomica 292:21–35
Google Scholar
Wywiał JL, Sitek G (2020) On influence of clustering population on accuracy of population total estimation. J Stat Comput Simul 90(2):234–251. https://doi.org/10.1080/00949655.2019.1676426
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This paper is a result of a grant supported by the National Science Centre, Poland, No. 2016/21/B/HS4/00666

Author information

Authors and Affiliations

Department of Statistics, Econometrics and Mathematics, University of Economics in Katowice, Katowice, Poland
Janusz L. Wywiał & Grzegorz Sitek

Authors

Janusz L. Wywiał
View author publications
You can also search for this author in PubMed Google Scholar
Grzegorz Sitek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Janusz L. Wywiał.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Decomposition of Matrix $\varvec{C}_{\mathcal{U}}$

[9] decomposed the diagonal element of $\varvec{V}(\tilde{\varvec{y}}_S)$ defined by expression (2) as the function of matrix $\varvec{C}_{\mathcal{U}}$. Their result can be generalized as follows. Using their result, we transform elements of $\varvec{C}_{\mathcal{U}}=[c_{{\mathcal{U}},i,j}]$ in the following way (see [12], pp. 139–150):

$$\begin{aligned}&(G-1)c_{{\mathcal{U}},i,j}=\sum _{h=1}^G(y_{U_h,i}-\bar{y}_{{\mathcal{U}},i})(y_{U_h,j}-\bar{y}_{{\mathcal{U}},j})\\&\quad =\sum _{h=1}^G(N_h\bar{y}_{U_h,i}-\bar{N}\bar{y}_i)(N_h\bar{y}_{U_h,j}-\bar{N}\bar{y}_j)\\&\quad =\sum _{h=1}^G((N_h-\bar{N})\bar{y}_{U_h,i}+\bar{N}(\bar{y}_{U_h,i}-\bar{y}_i))((N_h-\bar{N})\bar{y}_{U_h,j}+\bar{N}(\bar{y}_{U_h,j}-\bar{y}_j))\\&\quad =\sum _{h=1}^G(N_h-\bar{N})^2\bar{y}_{U_h,i}\bar{y}_{U_h,j}+\bar{N}^2\sum _{h=1}^G(\bar{y}_{U_h,i}-\bar{y}_i)(\bar{y}_{U_h,j}-\bar{y}_j)\\&+\,\bar{N}\sum _{h=1}^G(N_h-\bar{N})(\bar{y}_{U_h,j}-\bar{y}_j)\bar{y}_{U_h,i}+\bar{N}\sum _{h=1}^G(N_h-\bar{N})(\bar{y}_{U_h,i}-\bar{y}_i)\bar{y}_{U_h,j}\\&\quad =\sum _{h=1}^G(N_h-\bar{N})^2\bar{y}_{U_h,i}\bar{y}_{U_h,j}+\bar{N}^2\sum _{h=1}^G(\bar{y}_{U_h,i}-\bar{y}_i)(\bar{y}_{U_h,j}-\bar{y}_j)+\\&+\,2\bar{N}\sum _{h=1}^G(N_h-\bar{N})(\bar{y}_{U_h,i}-\bar{y}_i)(\bar{y}_{U_h,j}-\bar{y}_j)+\bar{N}\bar{y}_i\sum _{h=1}^G(N_h-\bar{N})(\bar{y}_{U_h,j}-\bar{y}_j)+\\&\quad +\,\bar{N}\bar{y}_j\sum _{h=1}^G(N_h-\bar{N})(\bar{y}_{U_h,i}-\bar{y}_i)= \sum _{h=1}^G(N_h-\bar{N})^2\bar{y}_{U_h,i}\bar{y}_{U_h,j}+\\&-\,\bar{N}^2\sum _{h=1}^G(\bar{y}_{U_h,i}-\bar{y}_i)(\bar{y}_{U_h,j}-\bar{y}_j)+2\bar{N}\sum _{h=1}^G(\bar{y}_{U_h,i}-\bar{y}_i)(\bar{y}_{U_h,j}-\bar{y}_j)N_h \\&\quad +\,\bar{N}\bar{y}_i\sum _{h=1}^G(N_h-\bar{N})(\bar{y}_{U_h,j}-\bar{y}_j) +\bar{N}\bar{y}_j\sum _{h=1}^G(N_h-\bar{N})(\bar{y}_{U_h,i}-\bar{y}_i)= \end{aligned}$$

$$\begin{aligned}&=\sum _{h=1}^G(N_h-\bar{N})^2\bar{y}_{U_h,i}\bar{y}_{U_h,j} +\bar{N}\sum _{h=1}^G(N_h-\bar{N})(\bar{y}_{U_h,i}-\bar{y}_i)(\bar{y}_{U_h,j}-\bar{y}_j)\\&\quad +\bar{N}\sum _{h=1}^G(\bar{y}_{U_h,i}-\bar{y}_i)(\bar{y}_{U_h,j}-\bar{y}_j)N_h +\bar{N}\bar{y}_i\sum _{h=1}^G(N_h-\bar{N})(\bar{y}_{U_h,j}-\bar{y}_j)\\&\quad +\,\bar{N}\bar{y}_j\sum _{h=1}^G(N_h-\bar{N})(\bar{y}_{U_h,i}-\bar{y}_i)=\sum _{h=1}^G(N_h-\bar{N})^2\bar{y}_{U_h,i}\bar{y}_{U_h,j}\\&+\,\bar{N}\sum _{h=1}^G(N_h-\bar{N})(\bar{y}_{U_h,i}-\bar{y}_i)(\bar{y}_{U_h,j}-\bar{y}_j+\bar{y}_j)\\&\quad +\,\bar{N}\sum _{h=1}^G(\bar{y}_{U_h,i}-\bar{y}_i)(\bar{y}_{U_h,j}-\bar{y}_j)N_h+\bar{N}\bar{y}_i\sum _{h=1}^G(N_h-\bar{N})(\bar{y}_{U_h,j}-\bar{y}_j)\\&\quad =\sum _{h=1}^G(N_h-\bar{N})^2\bar{y}_{U_h,i}\bar{y}_{U_h,j}+\bar{N}\sum _{h=1}^G(\bar{y}_{U_h,i}-\bar{y}_i)(\bar{y}_{U_h,j}-\bar{y}_j)N_h+\\&+\,\bar{N}\sum _{h=1}^G(N_h-\bar{N})\bar{y}_{U_h,i}\bar{y}_{U_h,j}. \end{aligned}$$

Finally, we have:

$$\begin{aligned} (G-1)c_{{\mathcal{U}},i,j}=\sum _{h=1}^G(N_h-\bar{N})N_h\bar{y}_{U_h,i}\bar{y}_{U_h,j}+\bar{N}\sum _{h=1}^G(\bar{y}_{U_h,i}-\bar{y}_i)(\bar{y}_{U_h,j}-\bar{y}_j)N_h. \end{aligned}$$

(16)

The decomposition of the ordinary covariance is:

$$\begin{aligned}&(N-1)c_{i,j}=\sum _{h=1}^G\sum _{k\in {U}_h}(y_{k,i}-\bar{y}_i)(y_{k,j}-\bar{y}_j)=\\&\quad =\sum _{h=1}^G\sum _{k\in {U}_h}(y_{k,i}-\bar{y}_{U_h,i})(y_{k,j}-\bar{y}_{U_h,j})+\sum _{h=1}^G(\bar{y}_{U_h,i}-\bar{y}_i)(\bar{y}_{U_h,j}-\bar{y}_j)N_h=\\&\quad =(N-G)c_{*i,j}+\sum _{h=1}^G(\bar{y}_{U_h,i}-\bar{y}_i)(\bar{y}_{U_h,j}-\bar{y}_j)N_h. \end{aligned}$$

This and expression (16) lead to the following:

$$\begin{aligned} (G-1)c_{i,j,{\mathcal{U}}}=\sum _{h=1}^G(N_h-\bar{N})N_h\bar{y}_{U_h,i}\bar{y}_{U_h,j}+\bar{N}(N-1)c_{i,j}-\bar{N}(N-G)c_{*i,j}. \end{aligned}$$

(17)

Moreover:

$$\begin{aligned}&\frac{1}{G-1}\sum _{h=1}^G(N_h-\bar{N})N_h\bar{y}_{U_h,i}\bar{y}_{U_h,j}=\frac{1}{G-1}\sum _{h=1}^G(N_h-\bar{N})y_{U_h,i}\bar{y}_{U_h,j}\nonumber \\&\quad =\frac{1}{G-1}\sum _{h=1}^G(N_h-\bar{N})((y_{U_h,i}-\bar{y}_{{\mathcal{U}},i})+\bar{y}_{{\mathcal{U}},i})((\bar{y}_{U_h,j}-\bar{y}_j)+\bar{y}_j)=\nonumber \\&\quad =\frac{1}{G-1}\sum _{h=1}^G(N_h-\bar{N})(y_{U_h,i}-\bar{y}_{{\mathcal{U}},i})(\bar{y}_{U_h,j}-\bar{y}_j)\nonumber \\&\quad +\,\frac{\bar{y}_j}{G-1}\sum _{h=1}^G(N_h-\bar{N})(y_{U_h,i}-\bar{y}_{{\mathcal{U}},i})+\frac{\bar{y}_{{\mathcal{U}},i}}{G-1}\sum _{h=1}^G(N_h-\bar{N})(\bar{y}_{U_h,j}-\bar{y}_j). \end{aligned}$$

(18)

This leads to the decomposition of matrix $\varvec{A}$ shown in Sect. 2.2.

1.2 Derivation of Expression (3)

Expression (17) leads to the following:

$$\begin{aligned}&\varvec{C}_{\mathcal{U}}=\varvec{A}+\frac{\bar{N}(N-1)}{G-1}\varvec{C}-\frac{\bar{N}(N-G)}{G-1}\varvec{C}_*\nonumber \\&\quad =\varvec{A}+\bar{N}\varvec{C}\left( \varvec{I}_m+\varvec{I}_m\frac{N-G}{G-1}-\frac{N-G}{G-1}\varvec{C}^{-1}\varvec{C}_*\right) \nonumber \\&\quad =\varvec{A}+\bar{N}\varvec{C}\left( \varvec{I}_m+\frac{N-G}{G-1}\left( \varvec{I}_m-\varvec{C}^{-1}\varvec{C}_*\right) \right) . \end{aligned}$$

(19)

This result, expressions (2) and (4) let us evaluate expression (3).

1.3 Proof of Theorem 2

Let $\varvec{F}$ and $\varvec{L}$ be $m\times m$ symmetric matrices. $\varvec{F}$ is non-negative definite, while $\varvec{L}$ is positive definite. There is a non-singular matrix $\varvec{G}$ such that $\varvec{L}=\varvec{G}^T\varvec{G}$ (see, e.g. [3], p. 218-219. Moreover, [3], p. 563) shows that:

$$\begin{aligned} |\varvec{F}-\uplambda \varvec{L}|=|\varvec{G}^2||\varvec{M}-\uplambda \varvec{I}|=|\varvec{L}||\varvec{F}\varvec{L}^{-1}-\uplambda \varvec{I}|=|\varvec{L}||\varvec{L}^{-1}\varvec{F}\varvec{L}^{-1}-\uplambda \varvec{I}| \end{aligned}$$

(20)

where

$$\begin{aligned} \varvec{M}=(\varvec{G}^T)^{-1}\varvec{F}\varvec{G}^{-1}. \end{aligned}$$

(21)

Therefore, matrices $\varvec{M}$, $\varvec{F}\varvec{L}^{-1}$ and $\varvec{L}^{-1}\varvec{F}$ have the same eigenvalues as roots of equation $|\varvec{F}-\uplambda \varvec{L}|=0$. If $\varvec{F}$ is non-negative (positive) definite, then $\varvec{M}$ is non-negative (positive) definite.

Let $\varvec{V}(\varvec{t}_{1s})=L$ and $\varvec{V}(\varvec{t}_{2s})=F$. This and the above properties let us immediately prove the first two equalities of Theorem 2.

Expression (20) let us rewrite $|\varvec{F}\varvec{L}^{-1}-\uplambda \varvec{I}|=0$ as follows:

$$\begin{aligned} |\varvec{F}-\uplambda \varvec{L}|=0,\qquad \varvec{F}-\varvec{L}-(\uplambda -1)\varvec{L}|=0,\qquad |(\varvec{G}^T)^{-1}(\varvec{F}-\varvec{L})\varvec{G}^{-1}-\kappa \varvec{I}|=0 \end{aligned}$$

where $\kappa =\uplambda -1$. If $\varvec{F}-\varvec{L}$ is non-negative definite, then matrix $(\varvec{G}^T)^{-1}(\varvec{F}-\varvec{L})\varvec{G}^{-1}$ is non-negative definite (see, e.g. [3], pp. 213) and $\kappa =\uplambda -1\ge 0$. This leads to $\uplambda \ge 1$. Hence, the first inequality of Theorem 2 is proved.

If matrix $\varvec{F}-\varvec{L}$ is non-positive definite, then $\varvec{L}-\varvec{F}$ is non-negative definite and $|\varvec{F}-\uplambda \varvec{L}|=0$ is equivalent to $|\varvec{F}-\varvec{L}-(\uplambda -1)\varvec{L}|=0$, $|\varvec{L}-\varvec{F}-(1-\uplambda )\varvec{L}|=0$ and

$$\begin{aligned} |(\varvec{G}^T)^{-1}(\varvec{L}-\varvec{F})\varvec{G}^{-1}-\kappa \varvec{I}|=0,\qquad \kappa =1-\uplambda . \end{aligned}$$

Therefore, matrices $\varvec{L}-\varvec{F}$ and $(\varvec{G}^T)^{-1}(\varvec{L}-\varvec{F})\varvec{G}^{-1}$ are non-negative definite and $\kappa =1-\uplambda \ge 0$, $\uplambda \le 1$. Hence, the second inequality of Theorem 2 is proved.

Let us consider the following ratio of the quadratic forms:

$$\begin{aligned} \frac{\varvec{\alpha }\varvec{F}\varvec{\alpha }^T}{\varvec{\alpha }\varvec{L}\varvec{\alpha }^T}=\frac{\varvec{\beta }\varvec{M}\varvec{\beta }^T}{\varvec{\beta }\varvec{\beta }^T} \end{aligned}$$

where $\varvec{\alpha }=\varvec{\beta }(\varvec{G}^T)^{-1}$ and $\varvec{L}=\varvec{G}^T\varvec{G}$. This and (16) and (17) let us write the following:

$$\begin{aligned} \uplambda _1\le \frac{\varvec{\beta }\varvec{M}\varvec{\beta }^T}{\varvec{\beta }\varvec{\beta }^T}=\frac{\varvec{\beta }\varvec{F}\varvec{L}^{-1}\varvec{\beta }^T}{\varvec{\beta }\varvec{\beta }^T}\le \uplambda _m, \end{aligned}$$

where $\uplambda _1$ and $\uplambda _m$ are minimal and maximal eigenvalue of matrix $\varvec{F}\varvec{L}^{-1}=\varvec{V}(\varvec{t}_{2s})\varvec{V}^{-1}(\varvec{t}_{1s})$, respectively. This directly leads to last expression of Theorem 2.

1.4 Proof of Expression (6) About Eigenvalues of the Homogeneity Matrix

The eigenvalues of the homogeneity matrix, given by (4), are roots of the following equation:

$$\begin{aligned} |\varvec{\varDelta }-\delta \varvec{I}|=0,\qquad |\varvec{C}-\varvec{C}_*-\delta \varvec{C}|=0,\qquad |\varvec{C}_*-(1-\delta )\varvec{C}|=0. \end{aligned}$$

(22)

According to the properties of matrix (see, e.g. [3], pp. 219) $\varvec{C}=\varvec{G}^T\varvec{G}$, where $\varvec{G}$ is symmetric because matrix $\varvec{C}$ is symmetric and positive definite. Therefore:

$$\begin{aligned} |\varvec{M}-(1-\delta )\varvec{I}|=0 \end{aligned}$$

(23)

where $\varvec{M}=(\varvec{G}^{-1})^T\varvec{C}_*\varvec{G}^{-1}$. Matrix $\varvec{M}$ is non-negative definite (see, e.g. [3], pp. 213). This let us write $1-\delta \ge 1$ and $\delta \le 1$.

The well-known decomposition of $\varvec{C}$ is:

$$\begin{aligned} (N-1)\varvec{C}=G\varvec{C}_\#+(N-G)\varvec{C}_* \end{aligned}$$

Matrix $(N-1)\varvec{C}-(N-G)\varvec{C}_*$ is non-negative definite because $\varvec{C}$ is positive definite and $\varvec{C}_*$, $\varvec{C}_\#$ are non-negative definite. Moreover, $\frac{N-1}{N-G}\varvec{C}-\varvec{C}_*=\varvec{C}+\frac{G-1}{N-G}\varvec{C}-\varvec{C}_*$ is non-negative definite. This and expression (22) let us write that equation $|\varvec{\varDelta }-\delta \varvec{I}|=0$ is equivalent to the following

$$\begin{aligned} \left| \varvec{C}+\frac{G-1}{N-G}\varvec{C}-\varvec{C}_*-\left( \delta +\frac{G-1}{N-G}\right) \varvec{C}\right| =0 \end{aligned}$$

Similarly to expression (23), there is symmetric matrix $\varvec{K}$ making the above equation equivalent to the following

$$\begin{aligned} \left| \varvec{E}-\left( \delta +\frac{G-1}{N-G}\right) \varvec{I}\right| =0 \end{aligned}$$

where

$$\begin{aligned} \varvec{E}=(\varvec{K}^{-1})^T(\varvec{C}+\frac{G-1}{N-G}\varvec{C}-\varvec{C}_*)\varvec{K}^{-1} \end{aligned}$$

is non-negative definite. This leads to inequality $\delta \ge -\frac{G-1}{N-G}$. Hence, inequality $-\frac{G-1}{N-G}\le \delta \le 1$ where $\delta$ is the eigenvalue of $\varvec{\varDelta }$ is proved.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wywiał, J.L., Sitek, G. The Influence of Clustering Population on Estimation Accuracy of Population Totals Vector. J Stat Theory Pract 15, 61 (2021). https://doi.org/10.1007/s42519-021-00196-x

Download citation

Accepted: 16 April 2021
Published: 21 May 2021
DOI: https://doi.org/10.1007/s42519-021-00196-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

The Influence of Clustering Population on Estimation Accuracy of Population Totals Vector

Abstract

Similar content being viewed by others

A Constrained Cluster Analysis with Homogeneity of External Criterion

Improved Family of Estimators of Population Variance in Simple Random Sampling

A Bayesian Approach to Cluster Sampling Under Simple Random Sampling

1 Introduction

2 Estimation Based on Cluster Sample

2.1 Basic Notations

2.2 Simple Cluster Sampling

2.3 Relative Efficiency

Definition 1

Theorem 1

Theorem 2

2.4 Clustering Algorithms

3 Accuracy Analysis

4 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

1.1 Decomposition of Matrix \(\varvec{C}_{\mathcal{U}}\)

1.2 Derivation of Expression (3)

1.3 Proof of Theorem 2

1.4 Proof of Expression (6) About Eigenvalues of the Homogeneity Matrix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The Influence of Clustering Population on Estimation Accuracy of Population Totals Vector

Abstract

Similar content being viewed by others

A Constrained Cluster Analysis with Homogeneity of External Criterion

Improved Family of Estimators of Population Variance in Simple Random Sampling

A Bayesian Approach to Cluster Sampling Under Simple Random Sampling

1 Introduction

2 Estimation Based on Cluster Sample

2.1 Basic Notations

2.2 Simple Cluster Sampling

2.3 Relative Efficiency

Definition 1

Theorem 1

Theorem 2

2.4 Clustering Algorithms

3 Accuracy Analysis

4 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 Decomposition of Matrix \(\varvec{C}_{\mathcal{U}}\)

1.2 Derivation of Expression (3)

1.3 Proof of Theorem 2

1.4 Proof of Expression (6) About Eigenvalues of the Homogeneity Matrix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation