# Neyman-type sample allocation for domains-efficient estimation in multistage sampling

- 452 Downloads

## Abstract

We consider a problem of allocation of a sample in two- and three-stage sampling. We seek allocation which is both multi-domain and population efficient. Choudhry et al. (Survey Methods 38(1):23–29, 2012) recently considered such problem for one-stage stratified simple random sampling without replacement in domains. Their approach was through minimization of the sample size under constraints on relative variances in all domains and on the overall relative variance. To attain this goal, they used nonlinear programming. Alternatively, we minimize here the relative variances in all domains (controlling them through given priority weights) as well as the overall relative variance under constraints imposed on total (expected) cost. We consider several two- and three-stage sampling schemes. Our aim is to shed some light on the analytic structure of solutions rather than in deriving a purely numerical tool for sample allocation. To this end, we develop the eigenproblem methodology introduced in optimal allocation problems in Niemiro and Wesołowski (Appl Math 28:73–82, 2001) and recently updated in Wesołowski and Wieczorkowski (Commun Stat Theory Methods 46(5):2212–2231, 2017) by taking under account several new sampling schemes and, more importantly, by the (single) total expected variable cost constraint. Such approach allows for solutions which are direct generalization of the Neyman-type allocation. The structure of the solution is deciphered from the explicit allocation formulas given in terms of an eigenvector \({\underline{v}}^*\) of a population-based matrix \(\mathbf{D}\). The solution we provide can be viewed as a multi-domain version of the Neyman-type allocation in multistage stratified SRSWOR schemes.

## 1 Introduction

Consider a stratified SRSWOR in a population *U* of size *N* with strata \(U_1,\ldots ,U_I\), which form a partition of *U*, and let \(N_h\) denote the size of the stratum \(U_h\). For a variable \({\mathcal {Y}}\) in *U*, we denote \(y_k={\mathcal {Y}}(k)\), \(k\in U\). The standard estimator of the total \(\tau =\sum _{k\in U}\,y_k\) has the form \({\hat{\tau }}_{st}=\sum _{h=1}^I\,N_h{\bar{y}}_h\), where \({\bar{y}}_h=\tfrac{1}{n_h}\sum _{k\in {\mathcal {S}}_h}\,y_k\) with \(n_h\) denoting the size of the sample \({\mathcal {S}}_h\) drawn from the stratum \(U_h\), \(h=1,\ldots ,I\). The variance of this estimator is \(D^2({\hat{\tau }}_{st})=\sum _{h=1}^I\,\left( \tfrac{1}{n_h}-\tfrac{1}{N_h}\right) \,N_h^2S_h^2\), where \(S_h^2=\tfrac{1}{N_h-1}\sum _{k\in U_h}\,(y_k-{\bar{y}}_h)^2\) is *h*th stratum population variance.

The basic question for such a setting is the optimal allocation, \({\underline{n}}=(n_1,\ldots ,n_I)\), of the sample among the strata. To this end, one may assign a given (relative) variance of the estimator \({\hat{\tau }}_{st}\) and minimize the costs expressed, for example, by the total sample size \(\sum _{h=1}^I\,n_h\). A related approach is to fix a total sample size \(n=n_1+\cdots +n_I\) and minimize the (relative) variance. Both cases are examples of the classical Neyman optimal allocation procedure which, for example, in the second case results in the allocation \(n_h=n\tfrac{N_hS_h}{\sum _{g=1}^I\,N_gS_g}\), \(h=1,\ldots ,I\). In both settings, the result is a simple consequence of minimization using the Lagrange function or can be concluded via the Schwartz inequality.

Recently, we observe a growing interest in more refined allocation methods (also in two-stage sampling) based on nonlinear programming ensuring efficient estimation procedures for the whole population, see, for example, Clark and Steel (2000), Lednicki and Wieczorkowski (2003), Clark (2009), Khan et al. (2010), Münnich et al. (2012), Gabler et al. (2012), Ballin and Barcaroli (2013), Valliant et al. (2013, 2015). Much less is known for allocation procedures which are domains efficient or both population and domains efficient—see, for example, Costa et al. (2004), Longford (2006), Choudhry et al. (2012)—referred to as CRH in the sequel, Molefe and Clark (2015) and Keto and Pahkinen (2017). All of them are again based on nonlinear programming and are designed for single-stage sampling schemes. To the best of our knowledge, the only examples of domains-efficient allocation procedures in two-stage sampling schemes are those related to the eigenproblem approach. Such approach will be explained and discussed in the sequel.

*h*denoting a stratum into

*i*denoting a domain), that is, we would like to control not only the overall (relative) variance but also (relative) variances in each of the domains. In the context of both multi- and small-area estimations, Longford (2006) suggested to minimize (under a constraint given by the total sample size) the objective function

*G*is a weight responsible for a priority for the variance of the population mean estimator. In the context of model-assisted methodology, this approach has been recently developed in Molefe and Clark (2015). Mathematically, the problem reduces to the Neyman allocation scheme. Similarly, when a given value is assigned for (1), the total sample size is minimized. The weights \((P_i,\,i=1,\ldots ,I)\) are designed in order to cover, at least to some extent, jointly the optimality issue for domains and for the whole population. As pointed out in Friedrich and Münnich (2018), the approach of Gabler et al. (2012) can be used also in this context (actually, they mention the case with \(GP_+=0\)). Since the objective function (1) is a weighted sum of domains and population variances, this approach does not give any convenient tool to control the quality of population and domains means estimators. Moreover, it is not clear how to assess the impact of values of weights \(P_i\), \(i=1,\ldots ,I\), and \(GP_+\) on variances \(D^2({\bar{y}}_i)\), \(i=1,\ldots ,I\), and \(D^2({\bar{y}}_{st})\). These issues are clearly visible in the numerical example given in “Appendix,” where such approach is confronted with the one we propose in this paper.

The NLP solution, as the one described above, is an efficient tool for applications. Such purely numerical approaches to allocation problems are popular in real surveys. A drawback of such methods is that they gave just numerical values and do not provide any information on the structure of the solution, which, for example, can be important for designing priorities for the domains.

Now we will describe an alternative approach to the problem of domains-overall-efficient allocation in the sampling scheme considered in CRH. The approach will allow to see the analytic form of the solution. The respective expression is based on a unique direction in the space \({{\mathbb {R}}}^I\), where the dimension *I* is equal to the number of domains. The rest of this section is just a warm-up illustration for the eigenproblem methodology we will apply in full swing in several multistage schemes in the main part of the paper.

*T*is an unknown positive constant. Such approach allows to fully control domains variability of (relative) variances of estimators—see the numerical example in “Appendix.” Moreover, under the above constraint, the unknown parameter

*T*controls not only relative variances in domains but also the overall relative variance \({S}\) of the estimator of the population mean. It follows from the fact that under (4), due to (3), the relative overall variance

*S*can be written as

### Remark 1.1

*T*. It is obvious that there exists a unique positive solution \(T=T^*\), which has to be derived numerically. Then, the allocation is given by (7) with \(T=T^*\).

As we mentioned above, there are alternatives for the eigenproblem approach to the (domains-population)-efficient allocation issue in the case of SRSWOR in domains. Except for a possibility mentioned in Remark 1.1, the same allocation can be obtained (up to box constraints) by CRH methodology if \(T_i:=\kappa _i T^*\) (with the value of \(T^*\) as computed in the eigenproblem procedure) and *g* is minimized through the NLP procedure. Similarly, each of these three approaches (CRH, LW and eigenproblem) can be applied in the case of stratified SRSWOR in each of the domains. It suffices to start the procedure with the Neyman allocation in every domain.

However, the situation changes drastically when two-stage (or multistage) sampling is taken under account. Then, as it will be explained in the following sections, even in the simplest case of a two-stage sampling with SRSWOR at both stages (and no stratification), the formula relating the sizes of samples at the first and the second stage with variances, an analog of the one which lead to (7), does not allow to get a simple equation, as (8) in Remark 1.1 for the unknown *T*. Therefore, such direct numerical approach is not possible. To the best of our knowledge, no analogs of the NLP procedure from CRH are available in the literature in the multistage setting. Nevertheless, nonlinearly constrained optimization solvers, for example MINOS, MOSEK or IPOPT, available on the Web through NEOS server can be used as potential tools for NLP answers to the two-stage extension of the original CRH problem.

It appears that in such as well as in a more complicated situation, optimal allocation issue can be conveniently handled through the eigenproblem methodology, which provides insight into the structure of the optimal solutions, though in some non-typical cases it may give not the optimal but only approximately optimal results. It suffers from the same drawbacks as the original Neyman optimal allocation; i.e., the natural box constraints can be violated and the solution typically is not integer valued. The main aim of the present paper is to show how such an eigenproblem approach works in several new settings involving multistage sampling. In Sect. 2, we consider two-stage sampling with stratified SRSWOR on both stages. Special simplified cases are described in Sect. 3. Then, we deal with the situation in which at one of the stages \(\mathrm {pps}\) sampling with replacement is used while at the other the sample is drawn according to the SRSWOR. Finally, in Sect. 5 we analyze three-stage sampling with SRSWOR at every stage. In all these cases, the allocation problem with the total cost constraint is solved via an eigenproblem for rank-one perturbations of diagonal matrices. The case of the \(\mathrm {pps}\) sampling with replacement at the first stage and the SRSWOR at the second stage is rather special—then, the eigenproblem is for a matrix of rank one and thus an analytic form of the eigenvector responsible for allocation is available.

The eigenproblem approach to efficient allocation in domains originally was proposed in Niemiro and Wesołowski (2001) (NW in the sequel) and recently developed in Wesołowski and Wieczorkowski (2017) (WW in the sequel). The major difference between the setting of these two papers and our setting is the form of the cost constraints: Here, we consider the single total cost constraint, while two constraints, one on the sample size of the PSUs sample and one on the expected sample size of the SSUs sample, were imposed jointly in these earlier papers. There are important consequences of such a change in the cost constraints. Due to the form of the cost constraint, our solution is a direct generalization of the Neyman-type allocation. In particular, it gives the Neyman-type solution in case when there are no domains (i.e., the whole population is a single domain). At the technical level, the population matrix \(\mathbf{D}\), everything is based on, is a rank-one perturbation of a diagonal matrix, while it was a rank-two perturbation of a diagonal matrix in NW and WW. There is also an important difference with NW and WW with respect to the structure of the allocation. The common feature is that there is an eigenvector \({\underline{v}}^*\) of the matrix \(\mathbf{D}\) which plays important role in the optimal allocation; however, in the case we consider here, it influences only the optimal allocation at the first stage, while in the cases considered in NW and WW the optimal allocation on both stages depends explicitly on respective version of \({\underline{v}}^*\).

## 2 Two-stage sampling with stratified SRSWOR at both stages

For any \(i=1,\ldots ,I\), the subpopulation \({\mathcal {V}}_i\) of primary sampling units (PSUs) of *i*th domain in *U* is stratified: \({\mathcal {V}}_i=\bigcup _{h=1}^{H_i}\,{\mathcal {V}}_{i,h}\). Let \(M_{i,h}\) denote number of PSUs in \({\mathcal {V}}_{i,h}\). Also every PSU understood as a collection of secondary sampling units (SSUs) is stratified: A PSU *j* from the stratum \({\mathcal {V}}_{i,h}\) is stratified into \(\bigcup _{g=1}^{G_{j,h,i}}\,{\mathcal {W}}_{i,h,j,g}\).

*U*is the population of students, \({\mathcal {V}}_i\) is subpopulation of schools in

*i*th region, while \({\mathcal {V}}_{i,h}\) is the stratum of schools in

*h*th district of \({\mathcal {V}}_i\) and \(M_{i,h}\) is the number of schools in \({\mathcal {V}}_{i,h}\). Moreover, \({\mathcal {W}}_{i,h,j,g}\) denotes students of grade

*g*of

*j*th school from district \({\mathcal {V}}_{i,h}\) and \(N_{i,h,j,g}\) denotes the number of students in \({\mathcal {W}}_{i,h,j,g}\). A sample \({\mathcal {S}}^{(I)}_{i,h}\) of \(m_{i,h}\) schools is drawn according to SRSWOR from \({\mathcal {V}}_{i,h}\), and then a sample \({\mathcal {S}}^{(II)}_{i,h,j,g}\) of \(n_{i,h,j,g}\) students is drawn by SRSWOR from \({\mathcal {W}}_{i,h,j,g}\) for each school

*j*belonging to \({\mathcal {S}}^{(I)}_{i,h}\). Here and below in the formulas for variances, a single subscript

*i*refers to region \({\mathcal {V}}_i\), a double subscript

*i*,

*h*refers to district \({\mathcal {V}}_{i,h}\), a triple subscript

*i*,

*h*,

*j*refers to

*j*th school from district \({\mathcal {V}}_{i,h}\) and a quadruple subscript

*i*,

*h*,

*j*,

*g*refers to grade

*g*of

*j*th school in district \({\mathcal {V}}_{i,h}\).

*h*th stratum of PSUs in

*i*th domain (we assume that it is constant within the stratum) and a single SSU from

*j*th PSU of

*h*th stratum of PSUs in

*i*th domain (we assume that it is constant within the PSU), respectively. Obviously, due to randomness of \({\mathcal {S}}^{(I)}_{i,h}\), the actual cost is a random variable. In such a situation, when one wants to impose a constraint on the total cost, the standard approach is to impose a constraint on its expected variable cost (EVC), see, for example, Ch. 12.8.1 of Särndal et al. (1992), which in the case considered here assumes the form:

We want to find the allocation that is a set of two tables: a two-way table \({\underline{m}}=(m_{i,h})\) and a four-way table \({\underline{n}}=(n_{i,h,j,g})\), which give minimal domain-wise relative variances \(T_i\), \(i=1,\ldots ,I\) and minimal relative overall variance \({S}\), under the constraints (10) imposed through priority weights and the EVC constraint (9).

The result below says that it can be achieved by searching for positive eigenvalue of a certain matrix based on population quantities and costs coefficients. The allocation is obtained from the respective eigenvector. The approach parallels earlier developments in this setting where, instead of using a single total average cost constraint, the first-stage and second-stage costs were treated separately. In particular, NW in 2001 considered a two-stage scheme with separate constraints for the size of PSUs and SSUs sample and with stratified sampling either at the first or at the second stage. As it has been already mentioned, a similar problem has been recently investigated in WW for two-stage stratified SRSWOR on both stages as well as a scheme with stratified Hartley–Rao scheme at the first stage and stratified SRSWOR at the second stage (also some variations of theses two basic setups were considered there). In that paper, again two constraints were jointly imposed: one for the cost incurred by the PSUs sample size, \(\sum _{i=1}^I\sum _{h=1}^{H_i}m_{i,h}=m\), and one for the cost generated by the expected SSUs sample size, \(\sum _{i=1}^I\,\sum _{h=1}^{H_i}\,\tfrac{m_{i,h}}{M_{i,h}}\,\sum _{j\in {\mathcal {V}}_{i,h}}\,\sum _{g=1}^{G_{i,h,j}}\,n_{i,h,j,g}=n\) (these formulas refer obviously to the stratified SRSWOR on both stages). In meantime, the eigenproblem approach has been further developed in a series of papers: Kozak (2004) (multivariate version of NW was considered with an application to agricultural surveys), Kozak and Zieliński (2005) (the original eigenproblem approach from NW, where it was assumed that relative variances are the same for all domains, was adapted to include priority weights for domains; also an application related to the real forestry survey was given). Only single-stage schemes were considered in both these papers. In the context, we consider here probably the most interesting is the paper (Kozak et al. 2008). These authors were concerned with a two-stage sampling with stratification at the first stage together with a single cost constraint similar to (9) and domains-related constraints like (10). However, their approach was restricted to the case when SSU’s sample sizes are the same for all PSU’s in a given stratum of a given domain. Also they did not consider stratification at the second stage. The latter restriction does not seem to be as serious as the former.

In our main result below, we use the notation introduced earlier in this section.

### Theorem 2.1

Then, \(\lambda ^*\) is simple and unique and \({\underline{v}}^*\) has all coordinates of the same sign.

### Remark 2.1

### Remark 2.2

Note that the problem we solved in Theorem 2.1 can be formulated equivalently as: Minimize the overall relative variance \({S}\) under constraints (10) on relative variances \(T_i\) in domains (\(i=1,\ldots ,I\)) and the expected overall cost constraint (9). The reason for validity of such a rephrasing of the original problem is that \({S}=T\tfrac{1}{\tau ^2}\sum _{i=1}^I\,\kappa _i\tau _i^2\) which is a consequence of \(T_i=\kappa _i T\), \(i=1,\ldots ,I\).

### Remark 2.3

The optimal allocation problem in two-stage sampling when no domains efficiency is taken under account has the well-known Neyman-type solution. For example, in case of no stratification on both stages, such solution under EVC constraint is given in Ch. 12.8.1 of Särndal et al. (1992). Our formulas (11) and (12) reduce to (12.8.13) and (12.8.14) of Särndal et al. (1992) in the case when \(I=1\), \(H_1=1\) and \(G_{1,1,j}=1\), that is when the whole population is a single domain, neither PSUs nor SSUs within PSU are stratified. The optimal allocation in the case of single domain with stratified SRSWOR for PSUs and SRSWOR for SSUs in every PSU from the first-stage sample is considered in Saini and Kumar (2015). The authors provide the NLP solution and then conclude that the same result can be obtained via Neyman-type approach. Actually, they consider a *p*-variate case. However, their optimal allocation formulas (16) and (17) for \(p=1\) are again special cases of (11) and (12). Note that the assumption \(A_h>0\) needed also for the numerical solution in that paper is (again for \(p=1\)) in full agreement with \(\gamma _{ih}^2>0\), which we assume in Theorem 2.1.

*U*consisting just of a single domain, i.e., when \(I=1\), the eigenvector cancels out from (11) and formulas (11) and (12) for optimal allocation are immediately reduced to (with the index \(i=1\) suppressed)

### Remark 2.4

The allocation results given in Theorem 2.1 should be compared to the domain-efficient allocation in the same stratified SRSWOR on both stages but with separate constraints for the size of the first-stage sample and for the expected size of the second-stage sample as given in Theorem 3.3 of WW. The basic difference is that in the latter paper both \(m_{i,h}\) and \(n_{i,h,j,g}\) depend on the eigenvector \({\underline{v}}^*\), while in the above result the eigenvector appears only in formula (11) for \(m_{i,h}\) and formula (12) is free from \({\underline{v}}^*\). This is the major, and by no means obvious, structural consequence of the fact that the constraint we consider here is imposed on the expected costs of the first and the second stage jointly.

### Proof of Theorem 2.1

*T*under constraints (9) and (10). Therefore, the Lagrange function has the form

The final part of the proof follows closely the argument given in WW and is recalled here just for the readers’ convenience.

Consider now the matrix \(\mathbf{D}\) and let \(\lambda ^*\) be its positive eigenvalue. To show that it is simple, unique and the eigenvector \({\underline{v}}^*\) attached to this eigenvalue has all coordinates of the same sign, we use the celebrated Perron–Frobenius theorem: If \(\mathbf{A}\) is a matrix with all strictly positive entries, then there exists a unique positive eigenvalue \(\nu \) of \(\mathbf{A}\); it is simple and such that \(\nu >|\lambda |\) for any other eigenvalue \(\lambda \) of \(\mathbf{A}\). The respective eigenvector (attached to \(\nu \) ) has all entries strictly positive (up to scalar multiplication)—see, for example, Kato (1981, Th. 7.3 in Ch. 1).

Now formulas (11) and (12) follow directly from (16) and (15). \(\square \)

### Remark 2.5

Of course, as always when such allocation problems are solved without the natural box constraints: \(m_{i,h}\le M_{i,h}\) and \(n_{i,h,j,g}\le N_{i,h,j,g}\) (and this is the case of eigenproblem approach), the solution may violate some of them. Then, it is standard to set \(m_{i,h}=M_{i,h}\) and \(n_{i,h,j,g}=N_{i,h,j,g}\) in all instances of violation of the respective box constraint and then repeat the minimization procedure for reduced population and reduced cost constraint. It may produce solutions which are not optimal (though, typically, close to them). On the other hand, it is known, for example, in the case of the problem of optimal allocation in stratified SRSWOR that it is possible to reduce the population since the optimal solution requires to take \(n_h=N_h\) in some strata. Then, minimization can be performed on such reduced population—see, for example, Lemma 1 in Stenger and Gabler (2005). This approach has been developed by introducing box constraints to the numerical procedure of optimal allocation in Gabler et al. (2012); computational aspects of such procedures are analyzed in Münnich et al. (2012) (with further references given in that paper).

We do not consider here also exact optimality with respect to integer solutions. In this context, it is worth to mention again stratified SRSWOR for which an integer-valued optimal allocation has been recently given in Wright (2017) and, another one, even earlier by Friedrich et al. (2015). Here, we are fully satisfied with, for example, random rounding of non-integer allocation, which typically gives solutions close to optimal.

## 3 Special cases

### 3.1 Stratification only at the first stage

*i*,

*h*,

*j*) and thus it allows for a considerable simplification of the notation used in Sect. 2. The constraints imposed by priority weights for relative variances in domains assume the form

*j*th PSU from \({\mathcal {V}}_{i,h}\): The number of SSUs is \(N_{i,h,j}\), the population variance among SSUs is \(S_{i,h,j}^2\) and the sample size is \(n_{i,h,j}\). Here, \(D_{i,h}^2\) and \(M_{i,h}\) have the same definition as in Sect. 2.

After computing the eigenvector \({\underline{v}}:=\texttt {v}\) as given in the last line of the R-code above, one can calculate the optimal sample sizes \(m_{i,h}\) and \(n_{i,h,j}\) according to (18) and (20), respectively.

The R-code given above was adapted from the full R-code as given in https://github.com/rwieczor/eigenproblem_sample_allocation, which was created (in connection with WW) for optimal fixed precision allocation in subpopulations in two-stage sampling with the stratified Hartley–Rao \(\pi \)ps scheme at the first stage and SRSWOR at the second stage and with constraints imposed separately on the size of the sample at the first and on the expected size of the sample at the second stage.

### 3.2 Stratification only at the second stage

*i*. It allows to also simplify the notation of Sect. 2. The constraints imposed by priority weights for relative variances in domains assume the form

*g*th SSU stratum of

*j*th PSU from \({\mathcal {V}}_i\). Moreover, \(G_{i,j}\) is the number of SSUs strata in

*j*th PSU from \({\mathcal {V}}_i\), \(M_i=\#({\mathcal {V}}_i)\) and

### 3.3 No stratification at stage one and two

*i*,

*h*,

*j*). In this case, the formulas are further simplified. The constraints imposed by priority weights for relative variances in domains assume the form

*j*th PSU from \({\mathcal {V}}_i\). Moreover,

## 4 Two-stage sampling with \(\mathrm {pps}\) sampling

### 4.1 \(\mathrm {pps}\) Sampling at the first stage and SRSWOR at the second stage

We draw the PSUs ordered sample \({\mathcal {S}}^{(I)}=(K_1,\ldots ,K_m)\) using \(\mathrm {pps}\) sampling, meaning that PSUs are drawn *m* times with replacement (that is, independently), *j*th with probability \(p_j\) which is proportional to its size, \(j\in {\mathcal {V}}\) (population of PSUs). Then, if *j*th PSU belongs to \({\mathcal {S}}^{(I)}\), we draw (by SRSWOR) from it a sample (of size \(n_j\)) of SSUs, obtaining in this way the sample \({\mathcal {S}}^{(II)}_j\), \(j\in {\mathcal {S}}^{(I)}\). Such sampling scheme is considered in Ch. 4.5 of Särndal et al. (1992) (in particular, in Result 4.5.1 the unbiased estimator and its variance are given). A population-efficient allocation procedure for this setup has been given recently in Valliant et al. (2015) as one of the options in the PracTools R package. Importance of this scheme is due to the fact that when the sample of PSUs is sufficiently small, sampling with or without replacement gives the same results. Consequently, very often in practice, the first-stage variance in \(\pi \mathrm {ps}\) without replacement sampling is approximated by its \(\mathrm {pps}\) version. It appears that in such case the eigenproblem methodology we develop here allows for a closed analytic formula for the eigenvector responsible for the domains-efficient allocation. It follows from the fact that the respective population matrix is of rank one. The details are given below.

*j*. Its variance has the form

*C*is the total expected cost of the survey, \(c_{I,i}\) is the cost incurred by a PSU from \(V_i\) (assumed to be constant within the domain) and \(c_{II,i,j}\) is the cost incurred by a SSU belonging to the

*j*th PSU from the

*i*th domain.

This setting is somewhat different, actually, simpler than considered earlier. It is due to the fact that in the expression for \(T_i\) all summands are multiplied by \(1/m_i\).

### Theorem 4.1

### Proof

*T*. Since the matrix \(\mathbf{D}\) is semi-positive definite of rank 1, the number

*T*is its only nonzero simple positive eigenvalue. Moreover, note that \({\underline{v}}^*:=\sqrt{C}{\underline{a}}\) is the eigenvector of \(\mathbf{D}\) associated with eigenvalue \(\Vert {\underline{a}}\Vert ^2/C\). Finally, from (22), we obtain

### 4.2 SRSWOR at the first stage and pps sampling at the second stage

For completeness of the picture for two-stage sampling involving pps approach, let us consider the situation when the PSUs sample \({\mathcal {S}}^{(I)}\) is drawn through SRSWOR and the SSUs sample by sampling with replacement with probabilities \(p_k\) proportional to the size of *k*th unit. Here, the simplification of Sect. 4.1 is no longer available. This case falls under the general framework developed in Sect. 3.3.

*m*is the number of PSUs drawn by the SRSWOR from the total of

*M*PSUs in the population, \(n_j\) is the number of “with-replacement” draws from

*j*th PSU, \(K_{j,\ell }\) is the SSU drawn from

*j*th PSU in the \(\ell \)th draw (with replacement), \(j\in {\mathcal {V}}\) (PSUs population of size

*M*). Evidently, \({\hat{t}}\) is unbiased for the population total. Its variance is

*C*is the total expected cost of the survey, \(c_{I,i}\) is the cost incurred by a PSU from \(V_i\) (assumed to be constant within the domain) and \(c_{II,i,j}\) is the cost incurred by a SSU belonging to the

*j*th PSU from the

*i*th domain.

## 5 Three-stage sampling without stratification

In multistage sampling, typically, we do not go beyond three-stage sampling. This scheme is described in detail, for example, in Särndal et al. (1992, Ch. 4.4.2). The optimal allocation of the sample between three stages under the cost constraints, with the additional simplifying assumption that the sizes of SSU and TSU (tertiary sampling unit) samples do not depend on PSU or SSU, respectively, had been studied already in Cochran (1977, Ch. 10.8) (see also Singh 2003, Ch. 10.4). Recently, the optimal allocation procedure, using a simplified variance formula with the standard constraints regarding the total costs, was designed in Valliant et al. (2015) as a part of their PracTools R package. An application of such a simple three-stage sampling design is given, for example, in Tate and Hudgens (2007).

In this section, we analyze the eigenproblem approach to the domain-efficient allocation of sample in three-stage sampling, but first we recall the Neyman-type optimal allocation in the case of no domains.

*U*under three-stage sampling with SRSWOR on every stage has the form

*L*and \(\ell \) denote the number of PSUs, \(M_j\) and \(m_j\) the number of SSUs in the

*j*th PSU, \(N_{j,k}\) and \(n_{j,k}\) the number of TSUs in (

*j*,

*k*)th SSU, in population and in the sample, respectively; moreover, \(S_I^2\), \(S_{II,j}^2\), \(S_{III,j,k}^2\) denote population variances for PSUs in

*U*, SSUs in

*j*th PSU and TSUs in (

*j*,

*k*)th SSU.

*j*th PSU and each TSU belonging to

*k*th TSU from

*j*th PSU of

*i*th subpopulation, while

*C*denotes the overall cost of the survey, obtained through the standard Neyman approach leads to the following optimal allocation solution

*T*is unknown and has to be minimized under an additional total EVC constraint which in the case of three-stage sampling assumes the form

*i*th subpopulation, each SSU belonging to

*j*th PSU of

*i*th subpopulation and, each TSU belonging to

*k*th TSU from

*j*th PSU of

*i*th subpopulation, while

*C*denotes the overall cost of the survey.

*F*with respect to:

- 1.\(\ell _i\) we get$$\begin{aligned}&-\tfrac{\lambda _i}{\rho _i^2\ell _i^2}\left( \gamma _i^2+L_i\sum _{j=1}^{L_i}\,\tfrac{1}{m_{i,j}}\left( \beta _{i,j}^2 +M_{i,j}\,\sum _{k=1}^{M_{i,j}}\,\tfrac{\delta _{i,j,k}^2}{n_{i,j,k}}\right) \right) \nonumber \\&\qquad +\,\mu \left( c_{I,i}^2+\tfrac{1}{L_i}\sum _{j=1}^{L_i}m_{i,j}\,\left( c_{II,i,j}^2+\tfrac{1}{M_{i,j}}\sum _{k=1}^{M_{i,j}}\,c_{III,i,j,k}^2n_{i,j,k}\right) \right) =0,\qquad \quad \end{aligned}$$(30)
- 2.\(m_{i,j}\) we get$$\begin{aligned}&-\tfrac{\lambda _iL_i}{\rho _i^2\ell _i m_{i,j}^2}\left( \beta _{i,j}^2+M_{i,j}\,\sum _{k=1}^{M_{i,j}}\,\tfrac{\delta _{i,j,k}^2}{n_{i,j,k}}\right) \nonumber \\&\qquad +\,\mu \tfrac{\ell _i}{L_i}\left( c_{II,i,j}^2 +\tfrac{1}{M_{i,j}}\sum _{k=1}^{M_{i,j}}\,c_{III,i,j,k}^2n_{i,j,k}\right) =0, \end{aligned}$$(31)
- 3.\(n_{i,j,k}\) we get$$\begin{aligned} -\tfrac{\lambda _iL_iM_{i,j}}{\rho _i^2\ell _im_{i,j}n_{i,j,k}^2}\delta _{i,j,k}^2+\mu \tfrac{\ell _i}{L_i}\, \tfrac{m_{i,j}}{M_{i,j}}\,c_{III,i,j,k}^2=0. \end{aligned}$$(32)

### Theorem 5.1

*T*the base of the relative variance has the form

### Remark 5.1

Note that in the case of no domains, i.e., when \(I=1\), the allocation formulas (37)–(39) as well as the formula for the optimal variance (40) are simplified to the Neyman-type allocation and optimal variance formulas as given in (24)–(26) and (27), respectively.

Note also that only the allocation of the first-stage sample and the optimal base of the variance depend on the eigenvector \({\underline{v}}^*\). Formulas (38) and (39) for the allocation of the second- and third-stage samples are given directly in terms of population quantities with no reference to the eigenvector \({\underline{v}}^*\).

## 6 Conclusions

In this paper, we search for Neyman-type solutions to domains-efficient allocation in multistage stratified sampling. Such a solution can be seen as an alternative to the purely numerical one proposed in CRH for stratified single-stage scheme. We develop the eigenproblem method originating in NW and use eigenvalues and eigenvectors for allocation which, under specified priority coefficients for the constraints on the domains relative variances, assures optimal estimation both in the whole population and in the domains. In particular, we consider two- and three-stage sampling. The novelty of the solutions we provide here, with respect to what is known for eigenproblem approach to domains-efficient allocation, is with respect to several aspects. The most important is that, in contrast to earlier situations, as, for example, in WW, a single total cost constraint is taken under account. In previous papers instead, two constraints related to (expected) samples sizes of the PSUs and SSUs, respectively, were jointly imposed. In those papers, the two-stage sampling with SRSWOR (or Hartley–Rao) schemes with stratification either at the first or the second stage was considered. Here, we apply the eigenproblem methodology also to new sampling schemes: stratified SRSWOR at both stages as well as \(\mathrm {pps}\) sampling with replacement and SRSWOR either at the first or the second stage and to the three-stage sampling with SRSWOR at each stage. In each of these cases, the allocation which assures optimality (under given domain priority weights) of estimators of domain totals is given in terms of eigenvectors of a population-dependent matrix (which typically is rank-one perturbations of a diagonal matrix). Moreover, the standard errors of the estimates in the domains and in the whole population are given in terms of the respective eigenvalue. The latter allows to interpret the solution as a direct generalization of Neyman-type optimal allocation to the multi-domain case. Another important consequence of the approach we use here is that through the analytic formulas, we obtained, the structure of the optimal allocation can be seen. For example, it is visible that only the first-stage optimal allocation is influenced by the eigenvector \({\underline{v}}^*\) of the population matrix \(\mathbf{D}\).

## Notes

### Acknowledgements

We are very thankful to two anonymous referees whose remarks allowed us to improve presentation of the paper. We are also grateful to R. Münnich for interesting discussions on different aspects of optimal allocation problems. Thanks to R. Wieczorkowski for help with computations regarding the example in “Appendix.”

### Open Access

This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## References

- Ballin, M., Barcaroli, G.: Joint determination of optimal stratification and sample allocation using genetic algorithm. Survey Methodol.
**39**(2), 369–393 (2013)Google Scholar - Choudhry, G.H., Rao, J.N.K., Hidiroglou, M.A.: On sample allocation for efficient domain estimation. Survey Methodol.
**38**(1), 23–29 (2012)Google Scholar - Clark, R.G.: Sampling of subpopulations in two stage surveys. Stat. Med.
**28**(29), 3697–3717 (2009)MathSciNetCrossRefGoogle Scholar - Clark, R.G., Steel, D.G.: Optimum allocation of sample to strata and stages with simple additional constraints. J. R. Stat. Soc. D
**49**, 197–207 (2000)CrossRefGoogle Scholar - Cochran, W.G.: Sampling Techniques, 3rd edn. Wiley, New York (1977)zbMATHGoogle Scholar
- Costa, A., Satorra, A., Ventura, E.: Using composite estimators to improve both domain and total area estimation. SORT
**19**, 69–86 (2004)zbMATHGoogle Scholar - Friedrich, U., Münnich, R., de Vries, S., Wagner, M.: Fast integer-valued algorithm for optimal allocations under constraints in stratified sampling. Comput. Stat. Data Anal.
**92**, 1–12 (2015)MathSciNetCrossRefGoogle Scholar - Friedrich, U., Münnich, R., Rupp, M.: Multivariate optimal allocation with box-constraints. Aust. J. Stat.
**47**, 33–52 (2018)CrossRefGoogle Scholar - Gabler, S., Ganninger, M., Münnich, R.: Optimal allocation of the sample size to strata under box constraints. Metrika
**75**(2), 151–161 (2012)MathSciNetCrossRefGoogle Scholar - Kato, T.: A Short Introduction to Perturbation Theory for Linear Operators. Springer, New York (1981)Google Scholar
- Keto, M., Pahkinen, E.: Sample allocation for efficient model-based small area estimation. Survey Methodol.
**43**(1), 93–106 (2017)Google Scholar - Khan, M.G.M., Maiti, T., Ahsan, M.J.: An optimal multivariate stratified sampling design using auxiliary information: an integer solution using goal programming approach. J. Off. Stat.
**26**(4), 695–708 (2010)Google Scholar - Kozak, M.: Method of multivariate sample allocation in agricultural surveys. Biom. Colloq.
**34**, 241–250 (2004)Google Scholar - Kozak, M., Zieliński, A.: Sample allocation between domains and strata. Int. J. Appl. Math. Stat.
**3**, 19–40 (2005)Google Scholar - Kozak, M., Zieliński, A., Singh, S.: Stratified two-stage sampling in domains: sample allocation between domains, strata and sampling stages. Stat. Probab. Lett.
**78**, 970–974 (2008)MathSciNetCrossRefGoogle Scholar - Lednicki, B., Wesołowski, J.: Localization of sample between subpopulations. Wiad. Statyst.
**39**(9), 2–4 (1994). (**in Polish**)Google Scholar - Lednicki, B., Wieczorkowski, R.: Optimal stratification and sample allocation between subpopulations and strata. Stat. Trans.
**6**(2), 287–305 (2003)Google Scholar - Longford, N.T.: Sample size calculation for small-area estimation. Survey Methodol.
**32**, 87–96 (2006)Google Scholar - Molefe, W., Clark, R.G.: Model-assisted optimal allocation for planned domains using composite estimation. Survey Methodol.
**41**(2), 377–387 (2015)Google Scholar - Münnich, R., Sachs, E.W., Wagner, M.: Numerical solution to optimal allocation problems in stratified sampling under box constraints. Adv. Stat. Anal.
**96**(3), 435–450 (2012)MathSciNetCrossRefGoogle Scholar - Niemiro, W., Wesołowski, J.: Fixed precision allocation in two-stage sampling. Appl. Math.
**28**, 73–82 (2001)MathSciNetzbMATHGoogle Scholar - Saini, M., Kumar, A.: Optimum allocation in stratified two stage design by using double sampling for multivariate surveys. Probab. Stat. Forum
**8**, 19–23 (2015)zbMATHGoogle Scholar - Särndal, C.-E., Swensson, B., Wretman, J.: Model Assisted Survey Sampling. Springer, New York (1992)CrossRefGoogle Scholar
- Singh, S.: Advanced Sampling Theory with Applications. Kluwer, Dordrecht (2003)CrossRefGoogle Scholar
- Stenger, H., Gabler, S.: Combining random sampling and census strategies: justification of inclusion probabilities equal to 1. Metrika
**61**, 137–156 (2005)MathSciNetCrossRefGoogle Scholar - Tate, J.I., Hudgens, M.G.: Estimating population size with two- and three-stage sampling designs. Am. J. Epidemiol.
**165**(11), 1314–1320 (2007)CrossRefGoogle Scholar - Valliant, R., Dever, J.A., Kreuter, F.: PracTools: computations for design of finite population samples. R J.
**7**(2), 163–176 (2015)CrossRefGoogle Scholar - Valliant, R., Dever, J.A., Kreuter, F.: Practical Tools for Designing and Weighting Survey Samples. Springer, Berlin (2013)CrossRefGoogle Scholar
- Wesołowski, J., Wieczorkowski, R.: An eigenproblem approach to optimal equal-precision sample allocation in subpopulations. Commun. Stat. Theory Methods
**46**(5), 2212–2231 (2017)MathSciNetCrossRefGoogle Scholar - Wright, T.: Exact optimal sample allocation: more efficient than Neyman. Stat. Probab. Lett.
**129**, 50–57 (2017)MathSciNetCrossRefGoogle Scholar