A simulated annealing-based algorithm for selecting balanced samples

Balanced sampling is a random method for sample selection, the use of which is preferable when auxiliary information is available for all units of a population. However, implementing balanced sampling can be a challenging task, and this is due in part to the computational efforts required and the necessity to respect balancing constraints and inclusion probabilities. In the present paper, a new algorithm for selecting balanced samples is proposed. This method is inspired by simulated annealing algorithms, as a balanced sample selection can be interpreted as an optimization problem. A set of simulation experiments and an example using real data shows the efficiency and the accuracy of the proposed algorithm.


Introduction
Balanced sampling refers to a class of techniques aimed at randomly selecting units from a given population. The selection of balanced samples was first introduced by Gini (1928) and later recalled by Yates (1946) and Thionnet (1953), stimulating from then on an increasing interest in addressing, by means of balancing, several practical survey sampling problems (see for example Falorsi and Righi 2008;Grafström and Tillé 2013;Brus 2015;Benedetti and Piersimoni 2017;Chauvet 2017;Marazzi and Tillé 2017;Tillé et al. 2018;Chauvet and Le Gleut 2019;Kermorvant et al. 2019). The peculiarity of balanced sampling consists of exploiting a priori auxiliary information available for all units of a population at the design stage, so that the expansion estimator (Narain 1951;Horvitz and Thompson 1952) returns the balancing variables totals known at the population level. The stronger the correlation among the variables of interest and the balancing variables, the higher the effiency of a balanced sampling design in estimating the population total.
Several contributions have been proposed in literature to select balanced samples (for a review see Valliant et al. 2000). All the methods are configured as partial solutions for a variety of reasons, the main of which are a considerable computing time when several balancing variables are used (e.g. in enumerative sampling; Ardilly 1991), and problems in respecting the original inclusion probabilities (e.g. in rejective sampling; Hájek 1981). A general solution for balanced sampling was finally proposed by Deville and Tillé (2004), whose cube method allows for the selection of balanced samples with equal or unequal inclusion probabilities and any number of auxiliary variables with fast execution time (Chauvet and Tillé 2006;Grafström and Lisic 2016). This method is based on a random transformation of the inclusion probabilities vector to draw a sample that exactly, or at least approximately, satisfies the original inclusion probabilities and balancing equations (Tillé 2011). Deville and Tillé (2004) provided a solution that allows for the selection of approximate balanced samples, while respecting the inclusion probabilities (for an exhaustive presentation of the cube method, see Deville and Tillé 2004;Tillé 2006).
The selection of balanced samples may be viewed as a constrained optimization problem with discrete variables and a solution may be borrowed from physics. In particular, a Markov chain Monte Carlo method aimed at solving complex combinatorial problems, such as e.g. the traveling salesman problem, is a well-known simulated annealing method (Kirkpatrick et al. 1983). This method uses the Metropolis-Hastings algorithm for approximating the global optimum of a given function and is a preferable alternative when the approximation of a global optimum is more important than finding a precise local one. For a review of the applications of simulated annealing, see e.g. Aarts and van Laarhoven (1987). In the present paper, we show that a deterministic version of simulated annealing may be applied in sampling context to select high quality, fixed-size balanced samples. A very fast algorithm called Simulated Annealing for Balanced Sampling (sabs) is presented and its quality in terms of sample balance and respect of inclusion probabilities is evaluated by means of Monte Carlo simulations. A comparison with the cube method is also carried out. The paper is structured as follows. In Sect. 2, preliminaries and notations are given. In Sect. 3, the new algorithm is presented and both technical and theoretical aspects are detailed. In Sect. 4, results of a simulation study exploring several scenarios are presented. Section 5 is devoted to the practical application of the algorithm to a real data-set. Finally, Sect. 6 discusses the results and concludes.

Preliminaries and notation
Let U be a finite population of size N composed by k units, with k ∈ {1, ..., N }. Let denote with y the variable of interest and with y k , k ∈ U , the value of y in the k − th population unit. The aim is to estimate the population total Y = k∈U y k . Let S be a random fixed-size sample defined as a subset of U selected under a probability distribution p(·) and according to a without replacement sampling design, such that Pr(S = s) = p(s) where s⊂U p(s) = 1, for each s ∈ U . A random sample may also be defined as a discrete random non-negative vector a = (a 1 , ..., a k , ..., a N ) T , where a k indicates the number of selections of the unit k ∈ U in the sample. The variable a k is an inclusion indicator, which is, in without replacement sampling, equal to 1 if the unit k is selected in the sample, and 0 otherwise (i.e. following a Bernoulli distribution). The inclusion probability π k is the probability for the unit k to be selected in the sample. This probability is derived from the chosen sampling design as π k = Pr(k ∈ S) = E p (a k ) = s⊂U s k p(s), for each k ∈ U . In a design-based perspective, it is possible to estimate the population total Y by using the expansion estimator (Narain 1951;Horvitz and Thompson 1952) Let z k = (z k1 , ...z k j , ..., z k J ) T be a vector of J auxiliary variables known for all the units in the population U , such that the knowledge of z k allows for the computation of J totals as Z j = k∈U z k j , j = 1, ..., J . When a sample is selected, the expansion estimator of J auxiliary variables can be computed asẐ j = k∈S z k j π k . A sampling design p(s) is said to be balanced on the vector of the auxiliary variables z T k = (z 1 , ..., z J ) if and only if it satisfies the balancing equations k∈S z k π k = k∈U z k .
The selection of a balanced sample may be a challenging task because it is a problem that cannot satisfy non-integer constraints. Indeed, there may exist a rounding problem that prevents the exact satisfaction of the balancing constraints. Therefore, the aim of this study is to find a design that exactly, or at least approximately, satisfies the balancing equations. The rounding problem becomes less relevant when the sample size is high, but this condition may not be compatible with the practical constraints of the conduction of a survey. In addition, the respect of the inclusion probabilities cannot be overlooked, as they are not always satisfied with some balanced sampling methods.
Moreover, balanced sampling may be formulated and solved as a linear programming problem. In this respect, to each sample is assigned a cost function C(s), which is equal to zero if the sample is perfectly balanced and has greater values as the sample becomes increasingly unbalanced. The aim here is to find a sampling design p(s) that minimizes the mean cost s⊂U C(s) p(s), subject to the constraint to respect the inclusion probabilities s⊂U p(s) = 1 and s⊂U ,s k p(s) = π k , with k ∈ U . Deville and Tillé (1998) demonstrated that the solution to the problem leads to the selection of a minimal support design. Unfortunately, applying linear programming is, in most cases, unfeasible, as it is necessary to enumerate all the possible samples. Therefore, the number of samples 2 N is too big to be a suitable solution. This reveals the need for a balanced sampling design to avoid such enumeration. In this respect, the cube method proposed by Deville and Tillé (2004) allows one to consider the 2 N possible samples as 2 N vectors in R N . This method provides a general solution to balance on several variables, with equal and unequal inclusion probabilities. The cube method is based on a random transformation of the inclusion probabilities vector and a sample is obtained when the inclusion probabilities are exactly satisfied and the balancing equations are satisfied as well as possible (Deville and Tillé 2004;Chauvet and Tillé 2006).

Simulated annealing-based algorithm
In balanced sampling, the aim, according to a multivariate distribution with conditions on the support, is to select a random sample of fixed-size. Simulated annealing may represent a suitable solution since it can be used to generate samples from any high-dimensional distribution if the probability function is known. Exploiting the idea at the basis of the method, a combinatorial optimization problem can be viewed as a stochastic process (Robert and Casella 2013), in which local conditional distributions are dependent from a global control parameter, i.e. the so-called temperature (for a readable explanation, see Geman and Geman 1984). Simulated annealing requires the generation of a finite sequence of decreasing values of the temperature (the annealing schedule) according to a probabilistic decreasing function (i.e. the Metropolis-Hastings algorithm), in order to converge toward a set of global optimal solutions (i.e. obtaining the minimum energy states). The parallelism with sampling problems easily follows. Specifically, let's define an energy function This represents a mean quadratic distance function among the estimated totals and the known totals of the balancing variables, for any possible configuration of the sample a ∈ {0, 1} N . Note that any distance function can be used as an energy function and no restrictions exist on it. Clearly, if f (a) = 0, the balance of the sample is exactly satisfied. The proposed sampling algorithm works to achieving this condition, by searching an optimal configuration. Hence, for configuration a (i) obtained at the i − th attempt of the algorithm, it may be possible to obtain another configuration a (e) , in which the inclusion indicator label is randomly exchanged among two units at each iteration, ensuring the respect of the sample size. The second configuration is preferred to the first if f a (e) < f a (i) , using a deterministic approach, following the idea of Besag (1986) in the image processing context. The maximum number of iterations is equal to M AX I T E R, where each iteration consists of N attempts, and where M AX I T E R is chosen by the user and N is the population size. Therefore, the algorithm performs a maximum of M AX I T E R · N attempts. The procedure stops if a pre-fixed respect of minimal balancing constraints is reached. Thus, a final configuration is always obtainable and no restrictions on summary index used in the convergence (C O N V ) check are needed.
Algorithm 1 sabs: simulated annealing-based algorithm for selecting balanced sampling.
Start with the initial configuration a (1) , given by a simple random sample of size n; d = ∞; The last selected configuration represents the selected sample.
This procedure implies that a local minimum, not necessarily a global one, is reached. The way to avoid the entrapment in local minima lays in the use of a probabilistic decreasing rule, such as in traditional implementation of simulated annealing. Unfortunately, the main contraindication is to require strong computational efforts, because the temperature needs to be decreased very slowly so that all possible solutions may be visited. This makes this procedure difficult to be used in practice, especially with a high number of balancing variables and with large populations. In addition, the search for global optima may produce undesirable solutions in terms of the selected samples. However, according to our computational experience, this algorithm is based on a deterministic decreasing rule that ensures the balancing constraints can be satisfactorily achieved in a reasonable number of iterations and obtains first-order inclusion probabilities very close to those desired, as showed in the following section.
From a practical point of view, the sabs selects a set of fixed-size random samples without replacement, according to the initial vector of inclusion probabilities. This means that for each configuration, it is possible to resort to the expansion estimator and to exploit its properties in total estimation, variance, and variance estimation (Horvitz and Thompson 1952).

Simulation experiments
In order to check the properties of the proposed sampling algorithm, several simulation experiments were carried out. Twelve populations have been generated according to different distributions and sizes. Indeed, an Uniform U ∼ (0, 1), a Normal N ∼ (0, 1) + 6, an Exponential E x p ∼ (0.5), and a Bimodal defined as a mixture of normals Bim ∼ 0.4N (2, 0.25) + 0.6N (4, 0.25), each with population sizes equal to 1000, 5000 and 10000 units, have been considered. These populations are referred to as U 1 -U 12 , and a summary of the characteristics is reported in Table 1.
For each population, J = 3; 5; 10 independent auxiliary variables have been generated, for a total of 36 scenarios. We considered a relevant number of auxiliary variables in order to evaluate the performance of the proposed method in the presence of a vast information asset.
For each of the above-mentioned scenarios, M = 10000 fixed-size samples have been selected with equal inclusion probabilities and with sampling fractions f = 0.01; 0.05; 0.1, by means of the sabs and cube methods. Indeed, the proposed algorithm has been compared with the cube method since it is the most used sampling design for the selection of balanced samples; therefore, it is the most appropriate competitor. Moreover, in order to investigate some extreme cases as well, four small populations (U 13 -U 16 ) of dimension N = 100 have been generated according to the aforementioned distributions, with J = 3; 5; 10 independent auxiliary variables.
Here also M = 10000 fixed-size samples have been selected, this time with sampling fractions f = 0.1; 0.25. Note that the sabs algorithm has been set to stop when the minimal balancing constraints equal to 0.1% is reached, with a maximum number of iterations equal to 10 × N .
The simulation focuses on two very important aspects in the evaluation of a balanced sampling design: respect of first-order inclusion probabilities and respect of balancing constraints. The former is investigated by means of the relative Root Mean Squared Error, defined as: where v k is the number of times the k-th unit is selected in M Monte Carlo replicates. The difference in the respect of the balancing constraints has been defined as: Moreover, a comparison in terms of computational time has been performed. In particular, as an example, we report the elapsed time to select one sample on the populations generated by U ∼ (0, 1).

Respect of the inclusion probabilities and of the balancing constraints
The most relevant results of the Monte Carlo experiments on U 1 -U 4 and U 9 -U 12 are reported in Tables 2 and 3  With respect to the inclusion probabilities, the cube method showed better performance than the sabs method. Furthermore, the performance difference between the two methods decreased when the sampling fraction increased. Note that the largest sampling fraction was considered equal to 10% of the population size. This behaviour was detected on all of the considered populations.
With respect to the balancing constraints, the performances of the sabs was very encouraging. The method showed errors very near to zero, for all three population sizes and for the smaller sampling fractions considered. The differences in performance among the sabs and cube methods increased with an increased population size (see results reported in the Supplementary Material).
The considerations made up until this point were valid for all of the considered distributions used to generate the populations presented so far.
With regard to the populations of size equal to 100 units, some simulation results are reported in Table 4 The latter populations represent a very particular case due to the very small population sizes, which in turn implies very small sample sizes. Neverthless, the behaviour of the sabs method is confirmed. The performances in respect of the inclusion probabilities remain the best for the cube method, even as they become very similar with the growth of the sampling fractions. It should be noted that with the bimodal distribution, the r RM SE π k is practically identical for both methods. In respect of the balancing constraints, the performances of the sabs were again confirmed as the best for every distribution and every sampling fraction considered.

Time of computation
The computing time needed for the selection of samples for a given sampling method could be an important factor in both surveys and simulations, especially if it is highly dependent on N and n. Table 5 reports the average CPU time in seconds taken by each of the algorithms to select a sample for different populations and sample sizes. Simulations were performed on a 2.3 GHz Dual-Core Intel Core i5. Samples were selected using R software. In particular, the sabs algorithm was implemented in C++ via Rcpp; the cube method was implemented by means of the BalancedSampling package (Grafström and Lisic 2016), which is a fast implementation in C++ via Rcpp, and by means of the sampling package (Tillé and Matei 2009). The sabs algorithm was the least computationally intensive method used, as it was sensibly quicker than the cube method implemented in both ways. It increased gradually with n and only proportionally to N , mainly because the number of attempts was set equal to N . Thus, we can be confident that the sabs algorithm can be effectively applied without extensive difficulties regarding large population sizes and without any storage or RAM limitations other than those represented by the size of the frame from which we wanted to select the sample.

An experiment on real data
Real data was used in order to investigate the efficiency of the sampling design when estimating a target variable in a real context. The dataset we used is freely available from www.statbel.fgov.be and concerns fiscal statistics on income subject to personal income tax. Data are related to the 581 Belgian municipalities for the period between 2005-2018. By focusing attention on information related to the last available year, we aim to estimate the total Belgian net income (Y ) by exploiting the three following auxiliary variables: the number of declarations with a total net taxable income greater than zero (z 1 ), the number of declarations where the amount of total taxes is greater than zero (z 2 ), and the number of residents per municipality (z 3 ). We selected M = 10000 fixed-size samples with equal inclusion probabilities, for sampling fractions f = 0.01; 0.05; 0.1, by means of the sabs algorithm, the cube method and simple random sampling without replacement (srswor). The latter was a traditionally used benchmark in survey sampling. Beside to the computation of r RM SE π k and C D, the population total of Y was estimated by the expansion estimatorŶ . We computed the relative Root Mean Square Error for the estimator, which is defined as Table 6 shows the results of r RM SE π k , C D, and r RM SEŶ , obtained by selecting samples with the three sampling methods considered.
The results of r RM SE π k and C D were in line with those seen in previous simulations. Indeed, the sabs algorithm demonstrated greater efficiency in respecting balancing constraints, while also performing worse in the respect of inclusion proba- bilities compared to srswor and the cube method. Clearly srswor shows considerable superiority compared to other balanced sampling methods, but this advantage is outclassed by its relevant inferiority in producing efficient estimates of a target variable.
In fact, the ability of the sabs and cube method to efficiently estimate the Y variable is evident, especially in regard to the former method. Hence, the efficiency of the proposed algorithm has been showed once again, together with its expendability in practical contexts.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.