1 Introduction

The role of Debabrata Basu in the critical development of survey sampling could be hardly overstated. His paper on the foundations of the subject (Basu , 1971) is a landmark which, for the first time - at least with such clarity - unveiled the irreconcilability between design-based inference and the likelihood principle. His criticism against the use of the Horvitz-Thompson estimator in order to guarantee unbiasedness of the estimates, extremely and colorfully expressed with the elephant example (see e.g. Welsh , 2010), has caused an incredibly vivid and interesting debate among statisticians; as a consequence, survey sampling has experienced, in the last decades, several attempts of radical restructuring of the foundations.

The aim of this paper is to explore and discuss the potential role of Bayesian ideas and techniques in modern survey sampling. The paper is structured as follows: § 2 discusses the theoretical conflict between design-based methods and the likelihood principle and highlights the role that a Bayesian approach could have. § 3 goes beyond the basic framework discussed in § 2 and discusses the ineluctability of a shift towards model-based techniques in modern survey statistics. § 4  reviews the most prominent and promising ideas for a Bayesian theory of inference for finite populations, namely

  • the Polya posterior approach, proposed in a series of papers by Glen Meeden and collaborators (see for example Ghosh and Meeden , 1997; Strief and Meeden , 2013).

  • the Calibrated Bayesian approach, popularized in several papers by Roderick Little (see for example Little 2006, 2011, 2022).

Then, § 5 considers a real case study, which we consider paradigmatic of the issues and the open problems discussed above. Finally, § 6 provides some concluding remarks.

2 The conflict

2.1 Basu’s criticism of design-based methods

To fix notation and ideas, we consider the simplest situation where a random sampling without replacement is drawn from a population P with N identified units. Here N is assumed to be known and units are identified through their labels, say \(\{1, 2,\dots , N\}\). We draw a sample s of size n and assume that the randomization scheme assigns a probability p(s) to this specific sample. The quantity of interest is the vector of values of a variable Y observed on the entire population, say \(Y_P=(y_1, y_2, \dots , y_N)\), or a specific function of it, say \(\tau =f(Y_P)\). However, \(Y_P\) is observed only on the units belonging to the sample s. Let \(Y_s\) be the set of observed values: then the goal is to make inferential statements on the values \(Y_{P\backslash s}\).

A design-based technique for producing an unbiased estimator of \(\tau \) is based on the Horvitz-Thompson strategy which suggests using an empirical version of \(\tau \), where the n observations are weighted with the inverse of their corresponding probabilities to be included in the sample. Unbiasedness is calculated with respect to the randomization scheme (the sampling design); this is usually inspired by, but not necessarily related to, \(Y_P\), which is considered fixed but unknown. Then Basu formally proved that the likelihood function for \(Y_P\) is flat, i.e. it is equal to a positive constant, for all values compatible with the observed \(Y_s\), and it is zero otherwise.

Many scientists have interpreted this result as a proof of a general inadequacy of the likelihood function - and the likelihood principle - as the main tool of the inferential process in this framework. On the other hand, Basu and other Bayesian statisticians believe that this specific context offers an example where the likelihood function provides obvious but correct results, and this clarifies the insufficient level of modeling of the design-based methods.

From a historical perspective, the first attempt to overcome the difficulties of using a likelihood function in a finite population sampling framework can be considered the scale load approach described in Hartley and Rao (1968), where the support of the quantity of interest Y is discretized into T different values, having frequencies \(N_1, N_2, \dots , N_T\) at the population level and \(n_1, n_2, \dots n_T\) at a sample level. Here the choice of T is not crucial and the Authors consider, as a likelihood function for the unknown vector \((N_1, \dots , N_T)\), the hypergeometric distribution associated with the observed sample:

$$\begin{aligned} L(N_1, \dots , N_T) = \prod _{t=1}^T {N_t \atopwithdelims ()n_t} \big / {N \atopwithdelims ()n}. \end{aligned}$$

The Authors described how to make consistent inferences on functions of \((N_1, \dots , N_T)\), through maximization of the likelihood or combining it with a suitable prior distribution. The scale-load approach can be considered prodromic to the more general notion of empirical likelihood, developed after Owen (1988) and reconsidered, in the context of finite population sampling in several papers: see for example Zhong and Rao (2000) and Berger (2018).

2.2 Design-based, likelihood and Bayes

In the last decades, the scientific debate around survey sampling has been quite vivid and it has dealt mainly with the contrast between design-based and model-based approaches. This is not the place to recall the details of the conflict, and we will only highlight the main points. A vivid and deep comparative analysis of different inferential approaches in finite population sampling can be found in Beaumont and Haziza (2022).

In survey sampling, many different issues must be considered in the optimization of the sampling plan. In spite of that, in a design-based philosophy, the construction of point estimators and confidence intervals is based on the randomization process and it is not always directly related to the quantity of interest \(Y_P\). In other terms, the design-based analysis leads to conclusions about the finite population quantity totally free of assumptions about the structure of the variation in the population (Cox 2006).

A pure likelihood analysis of such a problem is bound to provide trivial conclusions (Basu 1971; Godambe 1966): all the configurations of the parameter \(Y_{P}\) compatible with \(Y_s\) receive the same support from the likelihood; the likelihood function itself is not able to introduce any sort of similarities/dissimilarities among the units in the sample and those not observed. This is done surreptitiously in a design-based approach by silently assuming a sort of exchangeability among the units. The Bayesian road seems at least clearer. In order to provide inference on the vector \(Y_P\), which is now treated as a random vector, one needs to introduce a prior distribution in the game, and the prior is precisely the instrument that formalizes potential external information about the mutual similarities among units.

More in detail, and following (Little 2022), let us denote by \(S_P=(S_1, \dots , S_N)\) the vector of selection indexes for the N units of the population, that is

$$\begin{aligned} S_i = {\left\{ \begin{array}{ll} 1 &{} i\text {-th individual is in the sample}\\ 0 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$

If \(Z_P\) denotes a vector including all other design-related variables, the goal is to make inference on some quantity \(Q(Y_P)\), using possible covariate information \(Z_P\). A model-based inference approach will be based on the joint distribution

$$\begin{aligned} p_{s, y ; z}(s, y ; z, \theta , \psi ) = p_{y ; z}(y; z, \theta ) p_{s| y ; z}(s \vert y; z, \psi ), \end{aligned}$$

where \(\theta \) is a vector of parameters directly related to the variable of interest y and \(\psi \) only refers to the mechanism of inclusion.

Bayesian inference in this context requires the introduction of a prior distribution \(p_{y; z}(y; z)\) for the population values. Inferences are then based on the posterior predictive distribution of the non-sampled values \(Y_{P\backslash s}\) of \(Y_P\), given the sampled values. The prior distribution is often specified in a hierarchical way, where a parametric model \(p_{y; z}(y; z, \theta )\) indexed by parameters \(\theta \), in practice the one appearing in Eq. 2, is combined with a prior distribution \(p(\theta ; z)\). Then,Footnote 1 if we assume that - for the sake of simplicity - the sampling mechanism is ignorable,

$$\begin{aligned} p(y; z) = \int p(y \vert \theta , z) p(\theta ; z) d \theta . \end{aligned}$$

The posterior predictive distribution of the non-sampled values \(Y_{ P\backslash s}\) is then

$$\begin{aligned} P(Y_{ P\backslash s} \vert { Y}_s , z) = \int p(Y_{ P\backslash s} \vert {Y}_s ; z, \theta ) p(\theta \vert {Y}_s; z) d \theta \end{aligned}$$

where \(p(\theta \vert {Y}_s, z),\) is the posterior distribution of the hyper-parameters \(\theta \).

The issues of deep irreconcilability among different inferential paradigms, in general, do not cause huge differences in practice. In simple situations, if the units are approximately exchangeable, the use of the Horvitz-Thompson estimator would provide the same numerical answer that one could obtain with a weakly informative prior, or even using a Bayesian method without a prior, as in Ghosh and Meeden (1997) and Strief and Meeden (2013).

The debate between design-based and model-based approaches is basically internal to the non-Bayesian world and it appeared - and has increasingly become - relevant because of the more and more complex problems faced by modern survey sampling.

Much of the theory of model-based methodology has been developed in a non-Bayesian fashion, starting from the seminal and thought-provoking papers of Royall (1976, 1970) where the role of the likelihood function, when properly defined and interpreted, is deemed central also in a finite population framework and particularly for predictive purposes, provided that a superpopulation perspective is considered. The prediction approach is extensively discussed and supported in Valliant et al. (2000), and some Bayesian versions of super-population modeling can also be found in Zacks (2002) and Bolfarine and Zacks (1992).

The most prominent cases where a design-based approach has provided unsatisfactory results can be listed as follows:

  • Small Area Estimation, where the sample size in a sub-population/domain of interest may be so small (small area) to jeopardize the reliability of the estimates;

  • the presence of non-sampling errors in the selection of the sample, that can hardly be introduced in the randomization process;

  • the presence of non-random patterns of non-response and/or missingness among units; in these cases, some units might have a negligible or zero probability of being included in the sample and this occurrence practically destroys any possible assumption of exchangeability among the units, so requiring extra modeling;

  • inference based on the integration among survey data and other data sources, for instance, register-based data (Lohr and Raghunathan 2017).

These issues will be discussed in the next § 3.

3 Modern survey sampling

Basu’s classical Elephant example highlighted a few settings in which Horvitz-Thompson estimation could be inefficient, particularly when the sampling plan cannot be designed to proxy the distribution of the values of the variable(s) of interest. The use of auxiliary information has been the first device to improve the efficiency of design-based estimates. Indeed, the statistician in the Elephants’ example could have well saved his job at the circus by exploiting the knowledge of the population size and simply using the Hajek estimator! The model-assisted framework has kept researchers busy for more than 50 years and allowed them to employ auxiliary information at the estimation stage using countless assisting models to describe the relationship with the response variable(s). See Breidt and Opsomer (2017) for a recent review including modern regression techniques.

The model-assisted approach has brought models out of the shadow in design-based inference. Nonetheless, there are instances in which the design-based – even if model-assisted – framework is not enough to allow for consistent and/or efficient estimation strategies. The first and more notable is small area estimation (SAE), in which the sample size available in a sub-population of interest is so small that Horvitz-Thompson direct estimates, albeit unbiased, have unduly large variances.

SAE methods are indirect as they make use of observations coming from other domains/areas to obtain estimates for a particular area and are essentially model-based. SAE methods have seen tremendous development in the past 25 years: a complete review updated to the year of publication can be found in Rao and Molina (2015). In a presentation of the first edition of this book from 2003, Jon Rao admitted his genuine surprise when he finished writing it and realized that the longest chapters were those dedicated to methods using a hierarchical Bayesian approach. Then, if SAE can undoubtedly be considered the Trojan horse that brought model-based estimates into National Statistical Offices to address the need for timeliness and granularity of estimates, we may hint at a similar role for the Bayesian approach. See as a noticeable example the SAIPE program in the US United States Census Bureau (2021).

Another challenge that has reduced the suitability of design-based methods in survey sampling in the past years is the increased impact of non-sampling errors. When the outcome of under and/or over-coverage, of (item or unit) non-response, measurement error (including mode effects) depends on unknown processes, then it is impossible to draw a purely design-based inference, and the recourse to modeling is unavoidable. Modeling is required to reduce (a possibly non-negligible) bias in this context, rather than variance. The assumptions under which such bias reduction is achieved can seldom be verified and the inference is therefore model-based. This is true also when calibration or other reweighting methods such as raking are used to address under-coverage and/or non-response because, in this framework, modeling choices are implicit in the (possibly generalized) calibration/reweighting procedure (Haziza and Lesage 2016; Lesage et al. 2019).

In a Bayesian framework, non-response and under-coverage are often adjusted using inverse propensity weighting; see e.g. Little (1986). Indeed, propensity score adjustment was originally developed by Rosenbaum and Rubin (1984) to address selection bias in experimental designs but has been used extensively to control for selection bias in non-probability samples. Elliott and Valliant (2017) and the recent discussion paper by Wu (2022) provide a review of methods to draw inferences from such data. These can be seen as situations in which non-response and/or under-coverage get extreme consequences. Voluntary (typically online) surveys provide a large amount of (usually cheap) information that can provide misleading conclusions if bias is not properly mitigated. The large amount of information available with non-probability samples, and big data more in general, can lead to less trustworthy conclusions because of the apparent large sample size available (Meng 2018). In this context, (Lee 2006) provides evidence that a reference probability sample must be available to obtain reliable inference. Data integration is a very active field of research to develop techniques for combining a probability sample with a non-probability data source (see, for a review, Yang and Kim 2020). These are essentially model-based and can be grouped into two main approaches. The first approach is weighting and can be based on propensity score adjustments: propensity scores are pseudo-inclusion probabilities estimated based on covariates available for sampled and non-sampled units. Calibration is another weighting approach that estimates the weights directly by calibrating auxiliary information in the non-probability sample with that in the probability sample. In the second approach, superpopulation modeling for the variable(s) of interest collected on sample units is used to predict values for non-sampled units. This approach is closely related to mass imputation and multiple imputation (Rubin 2004) can be used in this framework. Doubly robust estimation methods combine the weighting and imputation approaches to improve the robustness against model misspecification (Kim and Haziza 2014).

Data integration can occur at the micro-level when the information coming from a small survey may be enriched by the extra information coming from administrative non-probabilistic lists. The link step is not generally flawless, due to measurement errors and changing status of the statistical units involved in the process. This kind of problem calls for record linkage techniques. Record linkage is a class of statistical and algorithmic methods that aim at identifying whether two or more observed records refer to the same statistical entity or not. Duplications of the same entity within one single source or across different files may be interpreted as “clusters of records”, showing strong similarities across their fields. Then the record linkage process may be also viewed as a formal Bayesian or non-Bayesian micro-clustering model; see Johndrow et al. (2018) and Tancredi et al. (2020)

All the above-mentioned issues become particularly relevant in the production of official statistics, where the problem of harmonizing and merging information coming from different sources becomes central as the general framework is moving towards an integrated system of statistical production and dissemination (D’Orazio et al. 2006). In fact, the National Statistical Institutes of well-developed countries, are progressively shifting from a survey-based system, where the sampling design played a decisive role, to an integrated system where administrative lists may help to build a more complex data structure that represents different populations of interest. This data structure may be then combined with specific surveys and/or the use of other types of non-probability/big data to produce statistical information at (a possibly very) granular level of domains.

4 New ideas from a Bayesian perspective

The logical conflict between design-based methods and Bayesian philosophy has generated a sort of practical separation between Official Statistics and Bayesian methodology with the unpleasant result that survey sampling is not a typical theme of research among Bayesian-oriented Ph.D. students, despite its relevance from an applied perspective. Currently, much of the academic research is devoted to creating or developing the Bayesian versions of the model-based procedures, already become a relevant part of the applied survey statistician toolkit. Nevertheless, there have been more systematic attempts to reformulate the entire survey sampling methodology from a Bayesian perspective.

From a historical perspective, the first instance of the practical relevance of Bayesian methodology in survey sampling other than the above-mentioned small area estimation problem can be dated back to the introduction of multiple imputation techniques (Rubin 2004) for dealing with non-response and, more generally, missing data issues.

A nonparametric Bayesian approach has been initially proposed by Ericson (1969) in the context of a simple random sampling. To overcome the problem of the already mentioned flatness of the likelihood function (Godambe 1966), an exchangeable prior on the N-dimensional parameter \(Y_P\) is assumed. Using a weakly informative prior, one re-obtains design-based results from a completely different perspective. Similar results were obtained by Lo (1986), where a Dirichlet-Multinomial process is introduced, which converges, as \(N \rightarrow \infty \), to a standard Dirichlet process. In the case of stratified sampling, priors are assumed exchangeable only within strata, in the spirit of hierarchical modeling (Rao 2011). Lo (1988) introduced the finite population Bayesian bootstrap (FPBB), which is defined in terms of a Polya’s urn scheme and it is implemented by simulating a posterior distribution starting from a flat Dirichlet-Multinomial prior, as described in Lo (1986).

4.1 The Polya Posterior

Building on the seminal work of Lo (1988), an alternative approach to inference for finite populations is described in a series of papers by Glen Meeden and his collaborators: see for example (Ghosh and Meeden 1997; Strief and Meeden 2013; Lazar et al. 2008); it is known as the Polya Posterior approach and it can be considered the finite population adaptation of the Bayesian Bootstrap proposed by Rubin (1981). Consider the following simple scenario. Suppose we have a population of N units and we draw a simple random sample of size n, say \(Y_s\): assume that the goal is to estimate the mean \(\theta \) of some function \(h(Y_P)\). We put the n observed units in another urn \(U_2\) and let the other \(N-n\) units in the original urn \(U_1\). Then we proceed as follows:

  1. 1.

    we draw a unit from \(U_2\) and observe its value y;

  2. 2.

    we draw a unit from \(U_1\), attach to it the y value and replace both units in \(U_2\);

  3. 3.

    repeat steps 1-2 until \(U_1\) is empty.

This way we have simulated a realization of the entire population. We repeat this simulation a huge number M of times, in order to get a posterior distribution for the quantity of interest \(Y_P\), which can be summarized using descriptive statistics. Of course, this should be only interpreted as a pseudo-Bayesian posterior, since no prior has been introduced. Nonetheless, (Lo 1988) provides important theoretical results for this procedure. Assume that in \(Y_s\) there are k distinct values and, for \(j=1; \dots , k\), let \(n_j\) be the corresponding frequencies in the sample. In performing an FPBB, let \( m^*_j\) be the random frequencies of the k distinct values.

Theorem 1

(Lo , 1988) The following statements hold:

  • The random vector \((m_1^*, \dots , m_k^*) \vert Y_s\) has a Dirichlet-Multinomial distribution (Mosimann 1962) with parameters \((N-n; n_1, n_2, \dots , n_k)\)

  • As \({N\rightarrow \infty }\),

    $$\begin{aligned} \left( \frac{m_1^*}{N-n}, \dots , \frac{m_k^*}{N-n} \right) \vert Y_s {\mathop {\rightarrow }\limits ^{d}} \text {Dirichlet}(n_1, n_2, \dots , n_k), \end{aligned}$$

    where \({\mathop {\rightarrow }\limits ^{d}}\) denotes convergence in distribution.

The Dirichlet-Multinomial distribution, cited in Theorem 1 can be interpreted as the multivariate extension of the Beta-Binomial distribution, that is a mixture of Binomial(np) distribution, with fixed n and p following a Beta distribution. Mosimann (1962) provides a general account of the properties of the Dirichlet-Multinomial distribution. Theorem 1, part b) can be used to say that, when the sampling fraction \(f=n/N\) is negligible, one can avoid actually performing simulations and approximate the posterior distribution with the Dirichlet distribution. This idea is crucial in the development of Polya’s Posterior methodology, especially when extra information on the population is available, and it can be translated into linear constraints on the Dirichlet random vector.

Although the Polya’s posterior does not stem from any specific prior distribution, (Lo 1988) also proved that it can be derived as the posterior distribution on \(\theta \) when the prior on the values of \(Y_s\) is a “flat” Dirichlet-Multinomial.

Polya’s Posterior approach then simulates the entire population and allows simple inferences on specific parameters of the population; it is particularly useful when many parameters need to be estimated at the same time. There have been several attempts to extend this approach to more general contexts. Strief and Meeden (2013) proposed an alternative step-wise Bayesian justification of the use of the sampling weights which is not directly related to the sampling design, and that makes use of the standard kind of information present in auxiliary variables: however, it does not assume a model relating the auxiliary variables to the characteristic of interest. Dong et al. (2014) made an attempt to extend the finite population Bayesian bootstrap of Lo (1988) to account for complex sample designs. The paper takes the same goal of the inverse sampling technique and it can be treated as the Bayesian finite population version of inverse sampling. Lazar et al. (2008) considers the problem of implementing a Polya’s Posterior approach in the presence of genuine partial information about auxiliary variables.

A limitation of the Polya posterior approach is that it requires an exchangeability assumption, not always tenable. In addition, (Rao 2011) noticed that “Also, it is not clear how this method can handle complex designs, such as stratified multistage sampling designs, or even single-stage unequal probability sampling without replacement with non-negligible sampling fractions, and provide design-calibrated Bayesian inferences.”

The use of the Bayesian Bootstrap in a finite population setting is also discussed in Aitkin (2008); Carota (2009), and Cocchi et al. (2022) where a procedure for estimating the variance in a multiple frame context is proposed.

4.2 Calibrated Bayes

In a series of papers, during the last 15 years, Roderick Little has strongly advocated the use of Bayesian methods in survey sampling and, more generally, in official statistics. To summarize in a few words, Little advocates a compromise between various approaches. While inference procedures should follow a Bayesian road, design features like clustering and stratification should be explicitly incorporated into the model to avoid the sensitivity of inference to model misspecification. In other terms, a purely design-based approach to finite population inference is no longer able to “adequately address many of the problems of modern sample survey” (Little 2022) and a model-based approach is deemed necessary: however, the model-based approach should be dressed in a Bayesian suit in order to easily incorporate survey sample design features. This compromise would guarantee good frequentist properties and would also benefit from the richness of information that the predictive posterior distribution allows obtaining.

Consider again the model-based framework expressed by the Eq. 2. If we ignore, for the sake of simplicity here, the issue of non-response, the distribution of \(S\vert Z,Y\) does not actually depend on Y and the likelihood function contribution to inference is restricted to the term \(p_{y\vert z}(y; z, \theta )\), which is combined with a suitable prior on \(\theta \) in order to produce the posterior predictive distribution Eq. 4 for the non-observable quantity \(Y_{P\backslash s}\).

This obvious consideration simply rules out any chance that the Bayesian answers could be efficient from a frequentist perspective if the word “frequentist” is meant in terms of the sampling mechanism. It is then clear that the frequentist properties should be considered either with respect to the conditional model induced by the family of distributions \(p_{y; z}(y; z, \theta )\), or to the joint distribution \(p_{y, s; z}(y, s; z, \psi , \theta )\).

It is well known (see Berger et al. 2009; Consonni et al. 2018) that a correct frequentist coverage of Bayesian procedure can be obtained only through the use of formal “noninformative” priors, whose exact expression depends on the specific statistical model. The derivation of a sensible noninformative prior is then not always easy. For example, usual improper priors which are routinely used in standard statistical models are not adequate for small area estimation and more generally for hierarchical models. See, as a general reference, (Berger et al. 2020) where the Authors derive a proper prior on the boundary of admissibility, which results as diffuse as possible without resulting in inadmissible procedures. A more specific analysis for small area models is described in Burris and Hoff (2019), where an alternative confidence interval procedure for the area means and totals is proposed under normally distributed sampling errors.

In general, the calibration of Bayesian procedures under complex sampling design is problematic and some approximations are often unavoidable. Things are even more complicated in the presence of non-ignorable non-response patterns, which must be taken into account in the sampling model. The next section is devoted to the description of such a real case study.

An alternative route that tries to combine design and Bayesian properties is proposed in Wang et al. (2017). Here the likelihood is replaced by the sampling distribution of some summary statistics with “design-based” properties: this “pseudo-likelihood” is then combined with a prior reflecting genuine or vague prior information. This approach, although approximated in principle, provided “calibrated Bayes” procedures when combined with noninformative priors.

5 A real case study

In this section, we discuss a real case study where we consider the potential benefits and the inherent difficulties of a fully Bayesian treatment of the problem. During the last years, the Italian National Statistical Institute (Istat, hereafter) has begun a long and complex process of reorganization of data production and dissemination, called modernization, which can basically be described in three steps.

  1. 1.

    A main global infrastructure consisting of an integrated system of statistical registers.

  2. 2.

    The introduction of repeated sample surveys with the goal of constructing, updating, and enriching the statistical registers by observing new variables.

  3. 3.

    An integrated use of non-probability data coming from different kinds of sources (e.g., big data) for producing new information such as Trusted Smart Statistics.

An important case study, illustrative of the new data production system, is the Italian Permanent Census (IPC), which replaces the general Population census, previously carried out every 10 years (the last one dates back to 2011): the new IPC system is a prototypical example of the new data production process and we now briefly describe it.

The starting point is the construction of the BRI (Base Register of Individuals). BRI is a “list” of \(N_R\) people who are residents in Italy, collected from all Italian municipalities; the BRI contains some core information such as gender, citizenship, and age, based on administrative data, which are considered highly reliable. BRI is then enriched with the reconstruction, through the implementation of suitable statistical models, of additional variables, namely educational level and employment status. The former is reconstructed via a log-linear model based on administrative data, while the latter is predicted through a suitably tailored hidden Markov model (Boeschoten et al. 2021).

Istat conducts two surveys to obtain an area sample \(s^A\) and a list sample \(s^L\) in order to evaluate the probabilities of under-coverage and of over-coverage, respectively, of BRI at the municipality level; these estimates are then used to correct administrative counts and to obtain estimates of the resident population. Population counts corrected for coverage errors are obtained through weighted counts of the BRI, where the weights are calculated as the ratio between the above probabilities.

The area and the list surveys are carried out using a sampling design that is quite common in National Statistical Institutes. In fact, they both follow a two-stage complex design where municipalities are Primary Sampling Units (PSUs) and households (for the list sample) or administrative geographical areas (for the area sample) are Secondary Sampling Units (SSUs). In particular, for the area sample \(s^A\), the SSUs are addresses and enumeration areas. For each year of the census cycle, both the area and the list surveys share the same sample PSUs. Nevertheless, the samples of households in the two surveys are negatively coordinated from one to the other and for different survey occasions. It is worthwhile noticing that an allocation step of the sample size of SSUs (frequently combined with balancing procedures) actually determines their inclusion probabilities.

With the goal of discussing the potential use of Bayesian models in a real NSI case, it is important here to report some details of the sampling design adopted for the permanent census. For the sake of brevity, we discuss in detail the sampling design of \(s^L\) for estimating over-coverage probabilities. Then, we discuss the modeling of under-coverage probabilities.

As detailed in Righi et al. (2021), at the first stage all municipalities with a population size larger than 18,000 inhabitants and all municipalities selected in the Labor Force Survey (LFS, hereafter) are classified as self-representative (SR), while the others are considered non-self-representative (NSR). All SR municipalities are included in the sample, and for the NSR municipalities, a sample is drawn according to a probabilistic sampling design as follows. Within each province (LAU1), NSR municipalities are stratified in order to obtain homogeneous strata in terms of population size. Each stratum consists of four PSUs, and one single PSU is drawn from each stratum each year according to simple random sampling without replacement; this way, in a four-year period, all the Italian municipalities can be observed. Then households are selected from each municipality adopting a simple random sampling without replacement design.

The allocation of the sample is then performed via a first sample size allocation among provinces, which is based on a trade-off between an equal sampling fraction and a sampling fraction inversely proportional to the population size of the provinces; indeed, a larger survey fraction is planned for smaller than for larger provinces. Afterward, in each province, the household sample was allocated within the municipalities as follows:

  • for SR municipalities, a trade-off between an equal sampling rate and a proportional allocation is considered, in order to limit the number of households in the larger municipalities;

  • for SR municipalities coming from the LFS, a proportional allocation is planned;

  • for SNR municipalities, the sampling fraction assigned to each stratum is proportional to the population size of the stratum, so that each municipality is assigned a sample of households that is also representative, at least in terms of size, of the other three municipalities included in the stratum but not included in the sample for that specific year.

In addition, a minimum number of 100 households has to be included in each municipality. As a consequence, municipalities with a smaller number of households are completely enumerated. Finally, all members of selected households are interviewed.

The complex, although quite standard, structure of this sampling plan is difficult to render from a Bayesian perspective in order to make the CB approach operative. Let \(N_D\) be the population count of interest for subpopulation (domain) \(P_D\) and let \(D_{\sum _{hailj}}\) be an indicator variable that takes the value 1 if unit j of household l in municipality i of enumeration area a of stratum h of the population belongs to \(P_D\) and 0 otherwise. Then \(N_{DR}=\sum _{h}\sum _{a}\sum _{i}\sum _{l}\sum _{j}D_{\sum _{hailj}}\) is the number of people in BRI for domain \(P_D\). The count estimates of the living population \(N_D\) can be obtained as

$$\begin{aligned} \hat{N}_D=\sum _h\sum _a\sum _i\sum _l\sum _j D_{\sum _{hailj}} \frac{1-\hat{p}^o_{hilj}}{1-\hat{p}^u_{hailj}}, \end{aligned}$$

where \(\hat{p}^o_{hilj}\) and \(\hat{p}^u_{hilj}\) are the estimated over and under-coverage probabilities for unit hilj computed from \(s^L\) and \(s^A\), respectively (see, for a similar approach, Pfeffermann 2015). A Bayesian treatment of the quantities \(\hat{p}^o_{hilj}\) and \(\hat{p}^u_{hilj}\) would easily allow to produce a posterior distribution for the overall quantity \(N_D\) and then produce a suitable measure of uncertainty. As noted before, samples in the two surveys are in general negatively coordinated, although in practice we consider them independent. Indeed, \(\hat{p}^o_{hilj}\) and \(\hat{p}^u_{hilj}\) are estimated for socio-demographic profiles for which the probabilities of over-coverage can be considered homogeneous and where the assumptions of the capture/recapture model hold (see, for more details, Righi et al. 2021).

More in detail, let us first focus on the over-coverage probabilities estimated using a Bayesian logistic model on data from \(s^L\). The latter is a two-stage sampling design and, following following Little (2006, 2022), the covariates that determine the sampling design must be included in the model to render the inclusion mechanism ignorable. In addition, a Bayesian hierarchical model should be used to deal with the within-cluster correlation.

Let \(y_{hilj}\) be the dichotomous random variable that is 1 when unit j of household l in municipality i of stratum h selected in \(s^L\) from BRI is not found for the interview and 0 otherwise. When \(y_{hilj}=1\), the unit in the register should not be counted in the population (over-coverage). Then, a possible hierarchical model for over-coverage can be written as follows

$$\begin{aligned} y_{hilj}|p_{hilj}^o&{\mathop {\sim }\limits ^{ind}}&{\textbf {Bernoulli}}(p_{hilj}^o)\nonumber \\ {\textbf {logit}}(p_{hilj}^o)= & {} \textbf{x}_{hilj}^T\varvec{\beta } + u_{hil} + v_{hi} + \gamma _{h} \end{aligned}$$
$$\begin{aligned} u_{hil}&{\mathop {\sim }\limits ^{iid}}&N(0;\sigma ^2_u) \end{aligned}$$
$$\begin{aligned} v_{hi}&{\mathop {\sim }\limits ^{iid}}&N(0;\sigma ^2_v) \\ p(\varvec{\beta }, \gamma _{h}, \sigma ^2_u, \sigma ^2_v)\propto & {} 1 \nonumber \end{aligned}$$


  • \(\textbf{x}_{hilj}\) collects individual-level covariates such as gender, age-class, citizenship, household-level covariates such as type of household or number of components, municipality-level covariates such as population size, type (urban/non-urban), and stratum level covariates such as macro-region;

  • \(u_{hil}\) is a household-level random effect;

  • \(v_{hi}\) is a municipality-level random effect;

  • \(\gamma _{h}\) is a stratum fixed effect.

Only variables included in the BRI can be in \(\varvec{x}\) since the model will be used to make a prediction of the variable y on the units in BRI not observed in the sample. In addition, the vector of first-order inclusion probabilities could also be included in \(\varvec{x}\) to account for extra-variability introduced by the complex survey design not explained by the design variables already introduced in the model.

For the regression parameters \(\varvec{\beta }\) and \(\gamma _{h}\), diffuse normal priors can also be considered that are sufficiently non-informative and computationally more convenient than flat priors over the real line. The normality of the random effects is a standard assumption in hierarchical models, while the choice of the prior for the variance components has been vastly debated, as in Bayesian mixed models the posterior distributions of these parameters are known to be sensitive to prior specification (Gelman 2006). Alternative choices can be the inverse Gamma for the variance or the half-Cauchy for the standard deviation.

The choice of the distribution for the household level random effect in Eq. 6 can be made more flexible by considering a different variance component for each possible household type, i.e.

$$\begin{aligned} u_{hil} {\mathop {\sim }\limits ^{iid}} N(0;\sigma ^2_{uk(l)}), \textrm{for}\ k=1,\ldots ,K. \end{aligned}$$

Here, k(l) denotes the group to which household l belongs to. In fact, there might be different household types determined by their size, the relationship among members, and other characteristics, for instance, households with one single person, of couples, of couples and children, and so on. The variance \(\sigma ^2_{uk}\) represents the similarity of the outcome variable on people in the same household typology. The use of different variance parameters \(\sigma ^2_{uk}\) on the \(u_{hil}\) would also allow removing some of the random effects when their posterior distributions pile up in a neighborhood of zero.

The municipality-level random effect in Eq. 7 can be further generalized by allowing for an interaction with a subset of the covariates in \(\varvec{x}\), say \(\varvec{z}\) of dimension q. Then, the equation for the linear predictor in Eq. 5 can be enhanced to be

$$\begin{aligned} {\textbf {logit}}(p_{hilj}^o) = \textbf{x}_{hilj}^T\varvec{\beta } + u_{hil} + \textbf{z}_{hilj}^T\varvec{v}_{hi} + \gamma _{h} \end{aligned}$$

where \(\varvec{v}_{hi} {\mathop {\sim }\limits ^{iid}} N_q(\varvec{0};\varvec{\Sigma }_v)\) and \(p(\varvec{\Sigma }_v)\propto 1\). An alternative choice for the prior could be a Wishart distribution for \(\varvec{\Sigma }_v^{-1}\). Variables in \(\varvec{z}\) could reflect the information used in the allocation of the sample of SSUs and/or directly the sample size. Similarly, the stratum-level fixed effect \(\gamma _h\) could be further generalized by including the interaction with a subset of the vector \(\varvec{x}\).

A similar modeling exercise can be developed for under-coverage that makes use of the data from the area survey \(s^A\). In this case, the administrative geographical area characterizing the sampling design may introduce an intra-cluster correlation that should be taken into account. Let \(y_{hialj}\) be the dichotomous random variable that is 1 when unit j of household l in administrative area a of municipality i of stratum h selected in \(s^A\) is found for the interview and is not in BRI and 0 when the unit is found for the interview and is in BRI. When \(y_{hialj}=1\), the unit should be counted in the population (under-coverage). Then, a possible hierarchical model for under-coverage can be written as follows:

$$\begin{aligned} y_{hialj}|p_{hialj}^u&{\mathop {\sim }\limits ^{ind}}&{\textbf {Bernoulli}}(p_{hialj}^u) \nonumber \\ {\textbf {logit}}(p_{hialj}^u)= & {} \textbf{x}_{hialj}^T\varvec{\beta } + w_{hial} + u_{hia} + v_{hi} + \gamma _{h} \nonumber \\ w_{hial}&{\mathop {\sim }\limits ^{iid}}&N(0;\sigma ^2_w) \nonumber \\ u_{hia}&{\mathop {\sim }\limits ^{iid}}&N(0;\sigma ^2_u) \\ v_{hi}&{\mathop {\sim }\limits ^{iid}}&N(0;\sigma ^2_v) \nonumber \\ p(\varvec{\beta }, \gamma _{h}, \sigma ^2_w , \sigma ^2_u, \sigma ^2_v)\propto & {} 1 \nonumber \end{aligned}$$

where \(w_{hial}\) is a household-level random effect, \(u_{hia}\) is a random effect related to the administrative geographical area, and \(v_{hi}\) and \(\gamma _{h}\) have an interpretation similar to that of Eq. 5. Also in this case, it can be useful to consider different characteristics of the administrative geographical areas, such as rural/urban, type of dwelling, and use an approach similar to that used for households in Eq. 8 to model \(u_{hia}\). Alternatively, these random effects can be assumed to be spatially correlated according to the distance \(d_{aa'}\) between areas a and \(a'\). For example, Eq. 9 can be replaced by

$$\begin{aligned} \varvec{u} \sim N_A(\varvec{0};\varvec{\Sigma }_u), \quad \Sigma _{aa^\prime } = \sigma ^2_u \exp \left( -\phi d_{aa^\prime }\right) , \quad p(\phi ) \propto 1 \end{aligned}$$

where A is the number of areas, \(\sigma _u^2\) is the variance at any given point, and \(\phi \) is a smoothing parameter that controls the scale of the correlation between areas. A Conditional Autoregressive specification can also be considered in which the conditional distribution of \(u_{hia}\) given values in all the remaining areas only involves the neighboring areas.

6 Conclusions

Finite population sampling is an important chapter of statistical theory that deserves particular attention and a specific methodology. Bayesian inference is based on a solid prescriptive and coherent mathematical theory, sometimes difficult to combine with the practical difficulties of survey sampling. Basu himself noticed, as reported in Zacks (2002):

The Bayesian as a surveyor must make all kinds of compromises... He may even agree to introduce an element of randomization into his plan... I can not put this enormous speculative process into a jacket of a theory. I happen to believe that data analysis is more than a scientific method...

The same concept is reiterated in Basu (1978)

I do not think that it is realistic to ask for a well-defined theory of survey sampling. The problem is too complex and too varied from case to case. I have no clear-cut prescription for the planning of a survey. Apart from saying that we ought to hold the data as fixed and speculate about the parameters I have indeed very little else to offer.

However, we believe that the Bayesian contribution to the development of a more efficient quantification of uncertainty in survey sampling can be valuable. In particular, the role of the prior distribution is crucial.

In the absence of genuine prior information, or when some sort of “objectivity” of the estimation process in the field of official statistics is required, the use of formal noninformative priors must be recommended in order to provide “calibrated answers” with good frequentist properties (Berger et al. 2022). In complex design, the derivation of the formal noninformative prior is really too difficult to obtain and approximations are necessary, as for example in Berger et al. (2020). However, approximations should not be confused with weakly informative priors, which could provide silly - and, even worse, prior-dependent - answers (Berger 2006), and this should be absolutely avoided.

In a completely different scenario, the use of available genuine prior information can be crucial and sometimes necessary. There are many cases where population parameters smoothly vary in time and space, like in Demography, and it is relatively easy to guess a priori the reasonable range of such quantities. The introduction of such information in the model would ease the calibration of the simulation algorithm on one hand. Of course, a sensitivity analysis to the prior inputs would be unavoidable in these cases; a significant dependence on the final answer to the prior inputs, however, should not be interpreted as a failure of the Bayesian approach but, rather, an indication that there might be too many parameters in the model and the data information is simply not enough to update all of them.

Finally, the last decades have experienced a real explosion, both in theoretical and applied terms, of Bayesian nonparametric methods of inference. Survey sampling has not yet been hit by this wave although the seminal papers by Lo (1986, 1988) seem to have paved the way. Some recent exceptions are Mendoza et al. (2021) and Savitsky and Toth (2016), and in the context of multiple imputation, (Paddock 2002).

To better reiterate our consideration of Basu’s work, we would like to conclude with another quotation, taken from Casella and Gopal (2011)

(Re-)Reading Basu’s papers, which combine an inimitable style of writing with impactful examples, is an educating, enlightening and entertaining experience. At best, we question our assumptions and beliefs, which leads us to gain new insights into classical statistical concepts. At “worst”, we embark on a journey to becoming Bayesian.