Enhanced Cube Implementation For Highly Stratified Population

A balanced sampling design should always be the adopted strategies if auxiliary information is available. Besides, integrating a stratified structure of the population in the sampling process can considerably reduce the variance of the estimators. We propose here a new method to handle the selection of a balanced sample in a highly stratified population. The method improves substantially the commonly used sampling design and reduces the time-consuming problem that could arise if inclusion probabilities within strata do not sum to an integer.


Introduction
In survey statistics, balanced sampling is a particularly efficient method when values of auxiliary variables are available for all units in the population. The idea is to select the sample so that the totals of the Horvitz-Thompson estimators of some auxiliary variables equal the population totals. There are different methods for selecting a balanced sample. Deville and Tillé (2004) have proposed the cube method which successively transforms the vector of inclusion probabilities into a sample. The method has been improved by Chauvet and Tillé (2006) by reducing the computation time.
In many areas, it is very useful to use stratified sampling designs. As already indicated by Neyman (1934), the variance of the Horvitz-Thompson estimator can be reduced by constructing strata such that the variables are homogeneous within the strata. Besides, Chauvet (2009) proposed a specific algorithm to obtain balanced samples in the strata of a population. However, this method becomes cumbersome when the number of strata is large.
A highly stratified population is very common in survey sampling. For example, it may be necessary to select individuals from a population while requiring that at most only one individual from each household in a population is taken. Each household is then a stratum. In spatial statistics, one can also construct small strata of neighboring units to obtain well-spread samples. Highly stratified sampling is also necessary for some donor imputation methods: the objective is to select a respondent for each non-respondent to impute its values. Each nonrespondent then defines a stratum from which a respondent is to be selected (Hasler and Tillé, 2016).
The balanced and stratified sampling method of Chauvet (2009) has been improved by Hasler and Tillé (2014) to partially resolve the disadvantage of the time required to process a highly stratified population. When the sum of the inclusion probabilities in the strata is not an integer, the computation time can become problematic. This problem arises, for example, when the objective is to select less than one individual per household. Neither of the two methods already proposed solves the computational time problem in these situations.
In this paper, we propose a new method to obtain a stratified balanced sample. This new method is particularly interesting when the population is highly stratified and the inclusion probabilities do not sum to an integer within the strata. We refer readers to Tillé (2020) and Hankin et al. (2019) to have more information on the general settings on stratified balanced sampling design.
The document is organized as follows. The section 2 gives the basic notations and settings. Section 3 present the problem of selecting a balanced sample. In the section 4, we review the cube method and how it is used to select a balanced sample. In section 5, we discuss the issue of the highly stratified population and review the methods used to select a sample in this case. In the section 6, we present the new method and the section 7 is devoted to variance estimation. In the section 8, we give the simulation results of the different algorithms on an artificial dataset while the section 9 gives a conclusion on the new method.

Basic sampling notations
Consider a finite population U of size N whose units can be defined by labels k ∈ {1, 2, . . . , N }. Let define a variable of interest y. Suppose that we are trying to estimate the following unknown total: A sampling design is defined by the probability p(s) of selecting each possible subset s ⊂ U such that s⊂U p(s) = 1. Consider a vector a = (a 1 , . . . , a N ) that maps elements of a subset s to an N vector of 0s and 1s such that: For each unit of the population, the inclusion probability π k , with 0 ≤ π k ≤ 1, is defined as the probability of selecting k into a sample s: Let π = (π 1 , . . . , π N ) be the vector of all the inclusion probabilities. Let also π k be the probability of selecting units k and together in the sample, with π kk = π k . Assuming that π k > 0 for all k ∈ U , the total (1) can be estimated using the classical unbiased Horvitz-Thompson estimator defined by (2)

Stratified balanced sampling
Usually, some auxiliary information are available for each unit k ∈ U in a vector x k = (x k1 , . . . , x kq ) ∈ R q , with q ∈ N * . A sampling design is said to be balanced on the q auxiliary variables if and only if it satisfies the following balancing equation: Sometimes, selecting a sample that satisfies exactly the constraints is not possible due to the rounding problem. In many applications, inclusion probabilities are such that the selected sample has a fixed size. In order to obtain a sampling design with fixed sample size, a linear combination of the auxiliary variables must be proportional or equal to the vector of inclusion probabilities, i.e. there exists ψ ∈ R q such that ψ x k = π k , for all k ∈ U . Indeed, this gives The size of the sample will be fixed only if n is an integer. If it is not the case, the sample size will be equal to the higher or lower integer to n. More generally, the problem of selecting a balanced sample is written as the following linear system : where A = (x 1 /π 1 , . . . , x N /π N ) . The aim consists then of obtaining a sample a that satisfies (or approximately satisfies) the constraints. Suppose that the population U is divided into H strata U 1 , . . . , U H , with respective sizes of N 1 , . . . , N H . The strata form a partition and respect the following properties: Then, this implies that H h=1 N h = N . The inclusion probabilities sum to a value n h in each stratum h, i.e. n h = k∈U h π k . Let h = (h 1 , . . . , h N ) be a categorical vector that specifies the stratum to which each unit belongs. For example, h k = means that unit k belongs to strata U , with k ∈ U and ∈ {1, . . . , H}. Another way for expressing the stratum of each unit is to use the disjunctive form. Let H be the disjunctive matrix of the corresponding vector h of size N × H, such that: where 1(U h ) ∈ R N is an column vector such that its kth element is equal to 1 if the unit k belongs to the stratum U h and 0 otherwise.
Obtaining a balanced sample in a stratified population is equivalent to adding stratification constraints to the previous linear system (3). These constraints are contained in the matrix H, so the modification of the linear problem gives: The number of constraints in the linear problem is then (q + H). In the next section, a method to select a balanced sample is presented.
4 Cube Method Deville and Tillé (2004) developed the cube method that selects a balanced sample respecting the inclusion probabilities. The method can deal with equal or unequal inclusion probabilities. The algorithm is separated into two phases.
• The first phase is called the flight phase. It modifies recursively and randomly the vector of inclusion probabilities π into a sample by respecting exactly the balancing constraints of the problem. The subspace induced by the linear system (3) could be rewritten using the following notation: The idea is then to use a vector u of the null space of A in order to update randomly the vector π. The whole procedure of the update can be found in Deville and Tillé (2004). At each step, at least one component is set to 0 or 1. Matrix A is updated with the new inclusion probabilities. This step is repeated until the null space of A is empty. At the end of the flight phase, the final updated vector of π contains at most q elements that are still not equal to 0 or 1.
• The second phase is called the landing phase. This phase allows to obtain the sample a that respects as much as possible the balancing constraints. There are two different ways to achieve it, by relaxing the q constraints one by one, or by linear programming.
In the flight phase, the major computational cost comes from the research of a vector in the null space of A . Chauvet and Tillé (2006) have improved this timeconsuming inconvenience using a sub-matrix of A rather than the entire matrix. The idea is to consider a submatrix that has one more row than the number of columns to ensure to have at least one vector in its null space. This submatrix, denoted by B, has then a size of (q+1)×q, with respect to q < N and Rank(B) ≤ q.
The interest of using this submatrix comes from the following result: a vector u of Null(B ) completed by (N − (q + 1)) zeros is a vector of Null(A ). With this idea, all the computations can be done using only a submatrix B. Usually, N is much greater than q, the size of B is then much smaller than A. This implies obviously an important gain of computational time. The method proposed in this paper uses the same idea. In the next section, the particular case of highly stratified sampling is considered.

Highly stratified population
It is always preferable to consider a stratified population in order to estimate the total (1). Indeed, the variance of the Horvitz-Thompson estimator (2) can be considerably reduced compared to the non-stratified estimator (1). However, when the population is highly stratified (i.e. H is very large), the selection of a balanced sample with classical methods becomes difficult due to the too large number of constraints in H. In order to decrease the time-consuming problem, different approaches have already been proposed. Chauvet (2009) has developed an algorithm to select a balanced sample in a highly stratified population. Firstly, a flight phase is applied inside each stratum. This allows modifying the inclusion probabilities such that these are as balanced as possible in each stratum. Next, a flight phase is applied on the whole population. Finally, a landing phase is carried out on units that are not still selected or rejected. This procedure has the advantage to be simple to implement. Its major deficiency is when the number of strata H becomes too large, the procedure remains very slow and often cannot even be used. Hasler and Tillé (2014) have proposed another method to deal with highly stratified population. As the previous method, it begins by applying the flight phase of the cube method to each stratum of the population. Next, it carries out a flight phase on an union of strata by adding another stratum at each step. By doing this, strata are managed one after the other and the inclusion probabilities of certain strata are set to 0 or 1 during this step. The idea behind this procedure is to reduce the matrix H considered because some strata are removed from the matrix when all its units are selected or rejected. At the end, a landing phase is applied. However, if n h is not equal to an integer for a stratum U h , this method also remains very time-consuming. Indeed, some strata are never completely removed during the procedure and then the submatrix of H considered becomes too large.
The properties of the cube method imply that the inclusion probabilities are satisfied and that the sample is balanced on the auxiliary variables in these two methods. However, they still have difficulty to deal with all the situations of highly stratified sampling. In the next following section, a new method is presented in order to completely resolve these drawbacks.

Proposed method
In the fast implementation of the cube method (Chauvet and Tillé, 2006), the main modification was to use a matrix smaller than A to update π. This allows to considerably decreasing the computational cost. The idea of our method is similar but adapted to a stratified population: consider a matrix of constraints B smaller than (H A) during the use of the cube method.
The submatrix matrix B must be found at each step of the flight phase of the cube method. As explained in Section 3, the number of balancing constraints depends on the number of strata H when the population is stratified. By considering a matrix B with fewer rows, or units, the corresponding vector of strata h will be reduced. This subvector of h will contain fewer categories and then the corresponding matrix H will have fewer columns. The number of constraints will therefore depend on the rows of B. This is why obtaining the matrix B with exactly one row more than its number of columns is not as easy as with an unstratified population. Algorithm 1 shows how to find the number of rows to consider to obtain the smaller matrix B such that B has exactly one row more than its number of columns.
Algorithm 1 Find the sub-matrix B of (H A) Let q be the number of auxiliary variables of A. Initialize q 1 by q. For t = 1, 2, 3, . . . repeat the following steps: 1. Extract the first q t rows of the vector h and denote it h t .

Denote H t the number of different strata in
Finally, B is defined as the q t first rows of the concatenated matrix (H t A t ), where A t and H t are the submatrix containing only its q t first rows.
The matrix B is found after having computed its number of rows q t using Algorithm 1. The first q t elements of h composed the strata membership vector h t . The disjunctive matrix H t can then be found using h t . The matrix B is equal to (H t A t ), with A t the submatrix of A containing only its q t first rows. The same procedure proposed by Chauvet and Tillé (2006) can then be applied. If the population is highly stratified and the number of auxiliary variables is acceptable, our procedure can be very efficient. Moreover, it handles inclusion probabilities that do not sum to an integer inside strata. Algorithm 2 presents the whole method.

Variance estimation
The variance can be approximated using the method proposed by Deville and Tillé (2005). Let the vector where (H A) k denote the kth row of the matrix (H A). The variance of the Horvitz-Thompson estimator of the total Y can be approximated by where There exists many different ways to express the quantity c k and then this leads to various approximations of the variance. Value c k can in particular be approximated by c k = (1 − π k ) n n − (H + q) .
Equation (5) can be estimated on a sample s using c k instead of c k and by replacing the sums on U by sums on s in Expressions (5) and (6).

Simulations
In this section, the performance of the method is evaluated on real data produced by the Swiss Federal Statistical Office (2020). The dataset contains information on Swiss establishments. We restrict the study to the Switzerland region called Espace Mittelland (a region of the second degree of the Nomenclature of Territorial Units for Statistics (NUTS) of Switzerland). This region contains 5 cantons (a region of the third degree of the NUTS) and 675 municipalities. For confidentiality reasons,
I. Perform a flight phase on each stratum according to the inclusion probabilities π and the balancing constraints in A . The vector π is updated by π 1 such that some of its elements are set to 0 or 1. Compute the set of indices i 1 ⊂ {1, . . . , N } containing the unit indices that have an inclusion probability still not equal to 0 or 1.
II. Initialize t by 1. Repeat step 1. to 6. until it is no more possible to find the matrix B or until the vector u is null.
1. In A, h and π, consider only units with indices in i t .
2. Apply the Algorithm 1 to find the submatrix B of (H A).
3. Compute u, a vector of the null space of B completed by 0s to obtain a vector with the same size as i t .
6. Update t by t + 1 and update i t the set of indices containing the unit that have an inclusion probability still not equal to 0 or 1.
III. It could remain some units that are still not rejected or selected. Perform a landing phase by suppression of variables on the balancing variables (H A) on the remaining indices i t .
the units considered are the hectares of land in which at least one establishment is located. In order to be able to estimate the variance, only 3 hectares of land per municipalities are included in the study. This implies that the dataset contains information from 2025 hectares including at least one establishment. We stratify the units in two different ways: by cantons and by municipalities. The number of strata is then respectively equal to H c = 5 and H m = 675. Figure  1 shows the dataset with the two proposed stratification. The idea behind this procedure is to compare the execution time for a stratified population with a low number of strata versus a high one. To compare the method, we will use balancing variables x j containing the number of women employed in a sector j, with j = 1, 2, 3. Each sector represents a type of activity: sector j = 1 involves the natural environment and agriculture, sector j = 2 is manufacture and sector j = 3 is related to services. Table 1 shows the mean time execution of three methods for highly stratified sampling: the methods of Hasler and Tillé (2014) and Chauvet (2009), detailed in Section 5, and the new one presented in this article.
Inside each stratum, the inclusion probabilities are equal. For each stratification, we carry out two different sampling: one with inclusion probabilities that sum to an integer number within each stratum and one with a non-integer sum. Then, we consider n h = 80 and n h = 80.4, h ∈ {1, . . . , H c }, for the first stratification. For the second one, we take n h = 2 and n h = 1.4, h ∈ {1, . . . , H m }. We choose to deal also with non-integers n h with the aim to compare the impact of this situation on the mean sampling time.
Chauvet's method cannot be compared because its execution time is too long and should be avoided for highly stratified population. However, it remains very efficient if the number of strata is acceptable. If inclusion probabilities in stratum sum to an integer, the Hasler's method performs very well. However, the execution time increases strongly when n h is not integer. The proposed method has a very well behaved and the time is considerably reduced for a highly stratified population.
In order to compare the variance of the method with the others, we estimate the variance using some variables of interest y j that contains the total number of employee of the sector j, j = 1, 2, 3. In Table 1, we compare the approximated variance (5), the estimated variance and the simulated variance computing using the equation: where m is the number of simulations. For each method, we vary the number of selected units within each stratum by taking n h equal to 2 for the stratification with H c strata and 80 for the stratification with H m strata. This implies sample of size respectively equal to 400 and 1350. The variance estimator seems to be unbiased for the approximated variance. However, we see that the approximated variance and estimator are slightly biased to the v i . This coming from the landing phase of each method. We can conclude that the proposed method is comparable in terms of variance to other methods.

Conclusion
The stratified sampling procedure is a well-known and appropriate procedure to reduce the variance of the Horvitz-Thompson estimator. In this paper, we propose a new method and implementation that provides an excellent executive time and flexibility that the existing methods did not allow. In many surveys where the population is stratified, the sum of inclusion probabilities within each stratum is not an integer. Other methods are not directly applicable in this case. We have shown by mean of simulations that the variance of the estimator is not impacted by our method. All of these results indicate that our proposed algorithm is very efficient to select a sample in a stratified and highly stratified population.