1 Introduction

Balanced sampling refers to a class of techniques aimed at randomly selecting units from a given population. The selection of balanced samples was first introduced by Gini (1928) and later recalled by Yates (1946) and Thionnet (1953), stimulating from then on an increasing interest in addressing, by means of balancing, several practical survey sampling problems (see for example Falorsi and Righi 2008; Grafström and Tillé 2013; Brus 2015; Benedetti and Piersimoni 2017; Chauvet 2017; Marazzi and Tillé 2017; Tillé et al. 2018; Chauvet and Le Gleut 2019; Kermorvant et al. 2019). The peculiarity of balanced sampling consists of exploiting a priori auxiliary information available for all units of a population at the design stage, so that the expansion estimator (Narain 1951; Horvitz and Thompson 1952) returns the balancing variables totals known at the population level. The stronger the correlation among the variables of interest and the balancing variables, the higher the effiency of a balanced sampling design in estimating the population total.

Several contributions have been proposed in literature to select balanced samples (for a review see Valliant et al. 2000). All the methods are configured as partial solutions for a variety of reasons, the main of which are a considerable computing time when several balancing variables are used (e.g. in enumerative sampling; Ardilly 1991), and problems in respecting the original inclusion probabilities (e.g. in rejective sampling; Hájek 1981). A general solution for balanced sampling was finally proposed by Deville and Tillé (2004), whose cube method allows for the selection of balanced samples with equal or unequal inclusion probabilities and any number of auxiliary variables with fast execution time (Chauvet and Tillé 2006; Grafström and Lisic 2016). This method is based on a random transformation of the inclusion probabilities vector to draw a sample that exactly, or at least approximately, satisfies the original inclusion probabilities and balancing equations (Tillé 2011). Deville and Tillé (2004) provided a solution that allows for the selection of approximate balanced samples, while respecting the inclusion probabilities (for an exhaustive presentation of the cube method, see Deville and Tillé 2004; Tillé 2006).

The selection of balanced samples may be viewed as a constrained optimization problem with discrete variables and a solution may be borrowed from physics. In particular, a Markov chain Monte Carlo method aimed at solving complex combinatorial problems, such as e.g. the traveling salesman problem, is a well-known simulated annealing method (Kirkpatrick et al. 1983). This method uses the Metropolis-Hastings algorithm for approximating the global optimum of a given function and is a preferable alternative when the approximation of a global optimum is more important than finding a precise local one. For a review of the applications of simulated annealing, see e.g. Aarts and van Laarhoven (1987). In the present paper, we show that a deterministic version of simulated annealing may be applied in sampling context to select high quality, fixed-size balanced samples. A very fast algorithm called Simulated Annealing for Balanced Sampling (sabs) is presented and its quality in terms of sample balance and respect of inclusion probabilities is evaluated by means of Monte Carlo simulations. A comparison with the cube method is also carried out. The paper is structured as follows. In Sect. 2, preliminaries and notations are given. In Sect. 3, the new algorithm is presented and both technical and theoretical aspects are detailed. In Sect.  4, results of a simulation study exploring several scenarios are presented. Section 5 is devoted to the practical application of the algorithm to a real data-set. Finally, Sect. 6 discusses the results and concludes.

2 Preliminaries and notation

Let U be a finite population of size N composed by k units, with \(k\in \left\{ 1,...,N\right\} \). Let denote with y the variable of interest and with \(y_{k},\) \(k\in U\), the value of y in the \(k-th\) population unit. The aim is to estimate the population total \(Y=\sum _{k\in U}y_{k}\). Let S be a random fixed-size sample defined as a subset of U selected under a probability distribution \(p(\cdot )\) and according to a without replacement sampling design, such that \(Pr(S=s)=p(s)\) where \(\sum _{s\subset U}p(s)=1\), for each \(s\in U\). A random sample may also be defined as a discrete random non-negative vector \(\mathbf {a}=\left( a_{1},...,a_{k},...,a_{N}\right) ^{\text {T}}\), where \(a_{k}\) indicates the number of selections of the unit \(k\in U\) in the sample. The variable \(a_{k}\) is an inclusion indicator, which is, in without replacement sampling, equal to 1 if the unit k is selected in the sample, and 0 otherwise (i.e. following a Bernoulli distribution). The inclusion probability \(\pi _{k}\) is the probability for the unit k to be selected in the sample. This probability is derived from the chosen sampling design as \(\pi _{k}=Pr(k\in S)=E_{p}(a_{k})=\sum _{s\subset U_{s\ni k}}p(s)\), for each \(k\in U\). In a design-based perspective, it is possible to estimate the population total Y by using the expansion estimator (Narain 1951; Horvitz and Thompson 1952) of the total \(\widehat{Y}=\sum _{k\in S}\frac{y_{k}}{\pi _{k}}=\sum _{k\in U}\frac{y_{k}a_{k}}{\pi _{k}}\), with \(\pi _{k}>0\).

Let \(\mathbf {z}_{k}=(z_{k1},...z_{kj},...,z_{kJ})^{\text {T}}\) be a vector of J auxiliary variables known for all the units in the population U, such that the knowledge of \(\mathbf {z}_{k}\) allows for the computation of J totals as \(Z_{j}=\sum _{k\in U}z_{kj},j=1,...,J\). When a sample is selected, the expansion estimator of J auxiliary variables can be computed as \(\hat{Z}_{j}=\sum _{k\in S}\frac{z_{kj}}{\pi _{k}}\). A sampling design p(s) is said to be balanced on the vector of the auxiliary variables \(z_{k}^{\text {T}}=(z_{1},...,z_{J})\) if and only if it satisfies the balancing equations

$$\begin{aligned} \sum _{k\in S}\frac{z_{k}}{\pi _{k}}=\sum _{k\in U}z_{k}. \end{aligned}$$

The selection of a balanced sample may be a challenging task because it is a problem that cannot satisfy non-integer constraints. Indeed, there may exist a rounding problem that prevents the exact satisfaction of the balancing constraints. Therefore, the aim of this study is to find a design that exactly, or at least approximately, satisfies the balancing equations. The rounding problem becomes less relevant when the sample size is high, but this condition may not be compatible with the practical constraints of the conduction of a survey. In addition, the respect of the inclusion probabilities cannot be overlooked, as they are not always satisfied with some balanced sampling methods.

Moreover, balanced sampling may be formulated and solved as a linear programming problem. In this respect, to each sample is assigned a cost function C(s), which is equal to zero if the sample is perfectly balanced and has greater values as the sample becomes increasingly unbalanced. The aim here is to find a sampling design p(s) that minimizes the mean cost \(\sum _{s\subset U}C(s)p(s)\), subject to the constraint to respect the inclusion probabilities \(\sum _{s\subset U}p(s)=1\) and \(\sum _{s\subset U,s\ni k}p(s)=\pi _{k},\) with \(k\in U\).

Deville and Tillé (1998) demonstrated that the solution to the problem leads to the selection of a minimal support design. Unfortunately, applying linear programming is, in most cases, unfeasible, as it is necessary to enumerate all the possible samples. Therefore, the number of samples \(2^{N}\) is too big to be a suitable solution. This reveals the need for a balanced sampling design to avoid such enumeration. In this respect, the cube method proposed by Deville and Tillé (2004) allows one to consider the \(2^{N}\) possible samples as \(2^{N}\) vectors in \(\mathbb {R}^{N}\). This method provides a general solution to balance on several variables, with equal and unequal inclusion probabilities. The cube method is based on a random transformation of the inclusion probabilities vector and a sample is obtained when the inclusion probabilities are exactly satisfied and the balancing equations are satisfied as well as possible (Deville and Tillé 2004; Chauvet and Tillé 2006).

3 Simulated annealing-based algorithm

In balanced sampling, the aim, according to a multivariate distribution with conditions on the support, is to select a random sample of fixed-size. Simulated annealing may represent a suitable solution since it can be used to generate samples from any high-dimensional distribution if the probability function is known. Exploiting the idea at the basis of the method, a combinatorial optimization problem can be viewed as a stochastic process (Robert and Casella 2013), in which local conditional distributions are dependent from a global control parameter, i.e. the so-called temperature (for a readable explanation, see Geman and Geman 1984). Simulated annealing requires the generation of a finite sequence of decreasing values of the temperature (the annealing schedule) according to a probabilistic decreasing function (i.e. the Metropolis-Hastings algorithm), in order to converge toward a set of global optimal solutions (i.e. obtaining the minimum energy states). The parallelism with sampling problems easily follows. Specifically, let’s define an energy function

$$\begin{aligned} f\left( \mathbf {a}\right) =\sqrt{\frac{\sum _{j=1}^{J}\left( \hat{Z}_{j}-Z_{j}\right) ^{2}}{J}}. \end{aligned}$$

This represents a mean quadratic distance function among the estimated totals and the known totals of the balancing variables, for any possible configuration of the sample \(\mathbf {a}\in \left\{ 0,1\right\} ^{N}\). Note that any distance function can be used as an energy function and no restrictions exist on it. Clearly, if \(f\left( \mathbf {a}\right) =0\), the balance of the sample is exactly satisfied. The proposed sampling algorithm works to achieving this condition, by searching an optimal configuration. Hence, for configuration \(\mathbf {a}^{(i)}\) obtained at the \(i-th\) attempt of the algorithm, it may be possible to obtain another configuration \(\mathbf {a}^{(e)}\), in which the inclusion indicator label is randomly exchanged among two units at each iteration, ensuring the respect of the sample size. The second configuration is preferred to the first if \(f\left( \mathbf {a}^{(e)}\right) <f\left( \mathbf {a}^{(i)}\right) \), using a deterministic approach, following the idea of Besag (1986) in the image processing context. The maximum number of iterations is equal to MAXITER, where each iteration consists of N attempts, and where MAXITER is chosen by the user and N is the population size. Therefore, the algorithm performs a maximum of \(MAXITER\cdot N\) attempts. The procedure stops if a pre-fixed respect of minimal balancing constraints is reached. Thus, a final configuration is always obtainable and no restrictions on summary index used in the convergence (CONV) check are needed.

figure a

This procedure implies that a local minimum, not necessarily a global one, is reached. The way to avoid the entrapment in local minima lays in the use of a probabilistic decreasing rule, such as in traditional implementation of simulated annealing. Unfortunately, the main contraindication is to require strong computational efforts, because the temperature needs to be decreased very slowly so that all possible solutions may be visited. This makes this procedure difficult to be used in practice, especially with a high number of balancing variables and with large populations. In addition, the search for global optima may produce undesirable solutions in terms of the selected samples. However, according to our computational experience, this algorithm is based on a deterministic decreasing rule that ensures the balancing constraints can be satisfactorily achieved in a reasonable number of iterations and obtains first-order inclusion probabilities very close to those desired, as showed in the following section.

From a practical point of view, the sabs selects a set of fixed-size random samples without replacement, according to the initial vector of inclusion probabilities. This means that for each configuration, it is possible to resort to the expansion estimator and to exploit its properties in total estimation, variance, and variance estimation (Horvitz and Thompson 1952).

4 Simulation experiments

In order to check the properties of the proposed sampling algorithm, several simulation experiments were carried out. Twelve populations have been generated according to different distributions and sizes. Indeed, an Uniform \(\mathcal {U}\thicksim \left( 0,1\right) \), a Normal \(\mathcal {N}\thicksim \left( 0,1\right) +6\), an Exponential \(Exp\thicksim \left( 0.5\right) \), and a Bimodal defined as a mixture of normals \(Bim\thicksim 0.4\mathcal {N}(2,0.25)+0.6\mathcal {N}(4,0.25)\), each with population sizes equal to 1000, 5000 and 10000 units, have been considered. These populations are referred to as \(U_{1}\) - \(U_{12}\), and a summary of the characteristics is reported in Table 1.

Table 1 Generated populations according to Uniform, Normal, Exponential, and Bimodal distributions, summarized for the population sizes

For each population, \(J=3;5;10\) independent auxiliary variables have been generated, for a total of 36 scenarios. We considered a relevant number of auxiliary variables in order to evaluate the performance of the proposed method in the presence of a vast information asset.

For each of the above-mentioned scenarios, \(M=10000\) fixed-size samples have been selected with equal inclusion probabilities and with sampling fractions \(\mathrm {f}=0.01;0.05;0.1\), by means of the sabs and cube methods. Indeed, the proposed algorithm has been compared with the cube method since it is the most used sampling design for the selection of balanced samples; therefore, it is the most appropriate competitor. Moreover, in order to investigate some extreme cases as well, four small populations (\(U_{13}\) - \(U_{16}\)) of dimension \(N=100\) have been generated according to the aforementioned distributions, with \(J=3;5;10\) independent auxiliary variables. Here also \(M=10000\) fixed-size samples have been selected, this time with sampling fractions \(\mathrm {f}=0.1;0.25\). Note that the sabs algorithm has been set to stop when the minimal balancing constraints equal to 0.1% is reached, with a maximum number of iterations equal to \(10\times N\).

The simulation focuses on two very important aspects in the evaluation of a balanced sampling design: respect of first-order inclusion probabilities and respect of balancing constraints. The former is investigated by means of the relative Root Mean Squared Error, defined as:

$$\begin{aligned} rRMSE_{\pi _{k}}=\frac{\sqrt{\frac{\sum _{k=1}^{N}\left( \frac{v_{k}}{M}-\pi _{k}\right) ^{2}}{N}}}{\mathrm {f}}, \end{aligned}$$

where \(v_{k}\) is the number of times the k-th unit is selected in M Monte Carlo replicates. The difference in the respect of the balancing constraints has been defined as:

$$\begin{aligned} CD=mean(\mathbf {d}), \end{aligned}$$

with \(\mathbf {d}=\left\{ d_{1},...,d_{m},...,d_{M}\right\} \) and \(d_{m}=max_{j}\left( \frac{\hat{Z}_{j,m}-Z_{j}}{Z_{j}}\right) \).

Moreover, a comparison in terms of computational time has been performed. In particular, as an example, we report the elapsed time to select one sample on the populations generated by \(\mathcal {U}\thicksim \left( 0,1\right) \).

4.1 Respect of the inclusion probabilities and of the balancing constraints

The most relevant results of the Monte Carlo experiments on \(U_{1}\) - \(U_{4}\) and \(U_{9}\) - \(U_{12}\) are reported in Tables 2 and 3 and are graphically summarized in Figs. 1, 2, 3 and 4, while remaining simulation results are reported in the Supplementary Material.

Table 2 Results of the Monte Carlo experiment on populations \(U_{1}\), \(U_{2}\), \(U_{3}\), and \(U_{4}\), generated according to Uniform, Normal, Exponential, and Bimodal distributions with \(J=10\) for three sampling fractions \(\mathrm {f}=0.01;0.05;0.1\). Monte Carlo replicates \(M=10000\)
Table 3 Results of the Monte Carlo experiment on populations \(U_{9}\), \(U_{10}\), \(U_{11}\), and \(U_{12}\), generated according to Uniform, Normal, Exponential, and Bimodal distributions with \(J=10\) for three sampling fractions \(\mathrm {f}=0.01;0.05;0.1\). Monte Carlo replicates \(M=10000\)
Fig. 1
figure 1

Results of the Monte Carlo experiment regarding the respect of the inclusion probabilities. Simulations were performed on populations \(U_{1}\), \(U_{2}\), \(U_{3}\), and \(U_{4}\), generated according to Uniform, Normal, Exponential, and Bimodal distributions with \(J=10\) for three sampling fractions \(\mathrm {f}=0.01;0.05;0.1\). Monte Carlo replicates \(M=10000.\) X-axis: rRMSE values; Y-axis: frequency (density) of selected samples

Fig. 2
figure 2

Results of the Monte Carlo experiment regarding the respect of the balancing constraints. Simulations were performed on populations \(U_{1}\), \(U_{2}\), \(U_{3}\), and \(U_{4}\), generated according to Uniform, Normal, Exponential, and Bimodal distributions with \(J=10\) for three sampling fractions \(\mathrm {f}=0.01;0.05;0.1\). Monte Carlo replicates \(M=10000.\) X-axis: rRMSE values; Y-axis: frequency (density) of selected samples

Fig. 3
figure 3

Results of the Monte Carlo experiment regarding the respect of the inclusion probabilities. Simulations were performed on the populations \(U_{9}\), \(U_{10}\), \(U_{11}\), and \(U_{12}\), generated according to Uniform, Normal, Exponential, and Bimodal distributions with \(J=10\) for three sampling fractions \(\mathrm {f}=0.01;0.05;0.1\). Monte Carlo replicates \(M=10000.\) X-axis: rRMSE values; Y-axis: frequency (density) of selected samples

Fig. 4
figure 4

Results of the Monte Carlo experiment regarding the respect of the balancing constraints. Simulations were performed on the populations \(U_{9}\), \(U_{10}\), \(U_{11}\), and \(U_{12}\), generated according to Uniform, Normal, Exponential, and Bimodal distributions with \(J=10\) for three sampling fractions \(\mathrm {f}=0.01;0.05;0.1\). Monte Carlo replicates \(M=10000.\) X-axis: rRMSE values; Y-axis: frequency (density) of selected samples

The simulation results motivate the following comments.

With respect to the inclusion probabilities, the cube method showed better performance than the sabs method. Furthermore, the performance difference between the two methods decreased when the sampling fraction increased. Note that the largest sampling fraction was considered equal to 10% of the population size. This behaviour was detected on all of the considered populations.

With respect to the balancing constraints, the performances of the sabs was very encouraging. The method showed errors very near to zero, for all three population sizes and for the smaller sampling fractions considered. The differences in performance among the sabs and cube methods increased with an increased population size (see results reported in the Supplementary Material).

The considerations made up until this point were valid for all of the considered distributions used to generate the populations presented so far.

With regard to the populations of size equal to 100 units, some simulation results are reported in Table 4, while the extended set-up is reported in the Supplementary Material. Sampling fractions have been increased compared to previous examples in order to provide the balancing on the same number of constraints.

Table 4 Results of the Monte Carlo experiment on the populations \(U_{13}\), \(U_{14}\), \(U_{15}\), and \(U_{16}\), generated according to Uniform, Normal, Exponential, and Bimodal distributions with \(J=10\) for two sampling fractions \(\mathrm {f}=0.1;0.25\). Monte Carlo replicates \(M=10000\).

The latter populations represent a very particular case due to the very small population sizes, which in turn implies very small sample sizes. Neverthless, the behaviour of the sabs method is confirmed. The performances in respect of the inclusion probabilities remain the best for the cube method, even as they become very similar with the growth of the sampling fractions. It should be noted that with the bimodal distribution, the \(rRMSE_{\pi _{k}}\) is practically identical for both methods. In respect of the balancing constraints, the performances of the sabs were again confirmed as the best for every distribution and every sampling fraction considered.

4.2 Time of computation

The computing time needed for the selection of samples for a given sampling method could be an important factor in both surveys and simulations, especially if it is highly dependent on N and n. Table 5 reports the average CPU time in seconds taken by each of the algorithms to select a sample for different populations and sample sizes. Simulations were performed on a 2.3 GHz Dual-Core Intel Core i5. Samples were selected using R software. In particular, the sabs algorithm was implemented in C++ via Rcpp; the cube method was implemented by means of the BalancedSampling package (Grafström and Lisic 2016), which is a fast implementation in C++ via Rcpp, and by means of the sampling package (Tillé and Matei 2009).

Table 5 Elapsed time in seconds for the selection of one sample (average over 10 Monte Carlo replicates) by means of the sabs algorithm and of two versions of the cube method. Populations used were \(U_{1},U_{5}\), and \(U_{9}\). Balancing variables: \(J=10\). Sampling fractions \(\mathrm {f}=0.01;0.05;0.1\)

The sabs algorithm was the least computationally intensive method used, as it was sensibly quicker than the cube method implemented in both ways. It increased gradually with n and only proportionally to N, mainly because the number of attempts was set equal to N. Thus, we can be confident that the sabs algorithm can be effectively applied without extensive difficulties regarding large population sizes and without any storage or RAM limitations other than those represented by the size of the frame from which we wanted to select the sample.

5 An experiment on real data

Real data was used in order to investigate the efficiency of the sampling design when estimating a target variable in a real context. The dataset we used is freely available from www.statbel.fgov.be and concerns fiscal statistics on income subject to personal income tax. Data are related to the 581 Belgian municipalities for the period between 2005–2018. By focusing attention on information related to the last available year, we aim to estimate the total Belgian net income (Y) by exploiting the three following auxiliary variables: the number of declarations with a total net taxable income greater than zero (\(z_{1}\)), the number of declarations where the amount of total taxes is greater than zero (\(z_{2}\)), and the number of residents per municipality (\(z_{3}\)). We selected \(M=10000\) fixed-size samples with equal inclusion probabilities, for sampling fractions \(\mathrm {f}=0.01;0.05;0.1\), by means of the sabs algorithm, the cube method and simple random sampling without replacement (srswor). The latter was a traditionally used benchmark in survey sampling. Beside to the computation of \(rRMSE_{\pi _{k}}\) and CD, the population total of Y was estimated by the expansion estimator \(\hat{Y}\). We computed the relative Root Mean Square Error for the estimator, which is defined as \(rRMSE_{\hat{Y}}=\frac{\sqrt{\frac{\sum _{m=1}^{M}\left( \hat{Y_{m}}-Y\right) ^{2}}{M}}}{Y}\). Table 6 shows the results of \(rRMSE_{\pi _{k}}\), CD, and \(rRMSE_{\hat{Y}}\), obtained by selecting samples with the three sampling methods considered.

Table 6 Results of the Monte Carlo experiment on the Belgian municipalities data for three sampling fractions \(\mathrm {f}=0.01;0.05;0.1\). Monte Carlo replicates \(M=10000\)

The results of \(rRMSE_{\pi _{k}}\)and CD were in line with those seen in previous simulations. Indeed, the sabs algorithm demonstrated greater efficiency in respecting balancing constraints, while also performing worse in the respect of inclusion probabilities compared to srswor and the cube method. Clearly srswor shows considerable superiority compared to other balanced sampling methods, but this advantage is outclassed by its relevant inferiority in producing efficient estimates of a target variable. In fact, the ability of the sabs and cube method to efficiently estimate the Y variable is evident, especially in regard to the former method. Hence, the efficiency of the proposed algorithm has been showed once again, together with its expendability in practical contexts.

6 Dicussion and conclusions

In the present paper, a new method of selecting balanced samples by means of a simulated annealing-based algorithm has been proposed. By exploiting the possibility of interpreting balanced sampling as an optimization problem, we showed that the sabs algorithm allows for a quicker selection of well-balanced samples on a wider set of auxiliary variables, not neglecting a rigorous respect of inclusion probabilities and a strong efficiency in estimation. An in-depth investigation about both aspects is necessary in order to prove the employability of a sampling method in practical situations. In fact, a good respect of balancing constraints results in a possible reduction of estimation variance, while the respect of inclusion probabilities means a reduction in estimation bias. Hence, a crucial issue concerns finding a sampling method capable of combining these two desirable properties. In the present paper, we proved through extensive simulation experiments, both on simulated and real data, that the sabs algorithm achieves this goal, resulting in a valid and efficient alternative to the well-known cube method.