Generalised calibration with latent variables for the treatment of unit nonresponse in sample surveys

Sample surveys may suffer from nonignorable unit nonresponse. This happens when the decision of whether or not to participate in the survey is correlated with variables of interest; in such a case, nonresponse produces biased estimates for parameters related to those variables, even after adjustments that account for auxiliary information. This paper presents a method to deal with nonignorable unit nonresponse that uses generalised calibration and latent variable modelling. Generalised calibration enables to model unit nonresponse using a set of auxiliary variables (instrumental or model variables), that can be different from those used in the calibration constraints (calibration variables). We propose to use latent variables to estimate the probability to participate in the survey and to construct a reweighting system incorporating such latent variables. The proposed methodology is illustrated, its properties discussed and tested on two simulation studies. Finally, it is applied to adjust estimates of the finite population mean wealth from the Italian Survey of Household Income and Wealth.


Introduction
Unit nonresponse occurs when sample selected units do not participate in a survey. When the decision whether or not to participate depends on one or more variables of interest, the missing data are said to be missing not at random. It is also referred to as nonignorable nonresponse because the missing data cannot be ignored when estimating finite population totals or means of survey variables in order to get consistent estimators (Deville 2000;Kott and Chang 2010).
In order to adjust for nonresponse, it would be essential to have access to granular auxiliary information on nonrespondents. Unfortunately, this is rarely the case. Generally, auxiliary information takes the form of a set of auxiliary variables whose value is known for the respondents and whose population totals are known or can be unbiasedly estimated. In this context, calibration weighting methods are often used to include auxiliary information at the estimation stage (Deville and Särndal 1992;Särndal 2007) and have been shown to be very useful in treating unit nonresponse as well (see e.g. Särndal and Lundström 2005). Their goal is the construction of a single set of weights to be used for all variables of interest modifying specified initial weights (usually the basic design weights), while satisfying benchmark constraints on known auxiliary information. No explicit model is specified for the treatment of the nonresponse mechanism; it is implicitly given by the calibration procedure.
Extending the calibration procedure, instrument vector method or generalised calibration allows distinction to be drawn among auxiliary variables between those that are useful to model unit nonresponse (instrumental or model variables) and those that are used in the calibration constraints (calibration variables). See Kott (2006), Chang and Kott (2008), and see Kott (2014) for a review on calibration weighting when model and calibration variables differ. It is important to note that in generalised calibration model variables need only be known for the respondents. This offers a particularly useful tool when the response probability of the units depends on one or more survey variables of interest and/or on other variables that we observe only on the respondents. Within this framework, the only proposal available in the literature to address unit nonresponse consists in using directly the variable of interest as an instrument in generalised calibration. Lesage et al. (2019) discuss the properties of the final estimator and the conditions under which it is successful in reducing nonresponse bias. This approach has at least two main pitfalls. First, nonresponse is likely to depend not just upon the variable of interest. This is particularly the case with multipurpose surveys. A second pitfall is due to the fact that observed variables may be affected by measurement error. Failing to consider such an issue may result in biased estimators.
In this paper, we propose to use latent variables as instruments in generalised calibration for the treatment of nonresponse. We believe that response to a survey is the outcome of a complex process that involves several aspects which cannot be captured by a single variable. Latent variable models can be used to extract either continuous constructs (latent trait models) or categorical ones (latent class models) from a set of manifest variables collected from respondents. Latent variable modelling has already been studied in the context of unit nonrepsonse: Matei and Ranalli (2015) discuss a setting in which a latent variable to be used as a covariate in a classical logistic model for response indicators is estimated from the set of response indicators in the presence of item nonresponse. Here, we present a more general framework to use latent variables in the context of calibration. In addition, we are able to test our approach on a real survey, the Survey of Household Income and Wealth conducted by the National Central Bank of Italy (see for instance Banca d'Italia 2018). This application is relevant since the survey collects very sensitive information and it is affected, like other similar surveys, by nonignorable unit nonresponse. In particular, rich households are known to be difficult to enroll in a sample survey. We use the 2008 wave for which we have also auxiliary information that enables us to enhance the performance of our estimator. To the best of our knowledge, this is the first application to real survey data of generalised calibration to deal with the issue of nonignorable nonresponse.
Notation and a review of generalised calibration as a treatment of unit nonresponse is provided in Sect. 2. A brief overview of the latent variable models used in the paper is provided in Sect. 3. The proposed methodology is introduced in Sect. 4. We test the methodology using two simulation studies (Sect. 5) and, finally, apply it to adjust estimates from the Italian Survey of Household Income and Wealth (Sect. 6). Some concluding remarks are provided in Sect. 7.

Calibration and generalised calibration for unit nonresponse
We are interested in estimating finite population quantities of interest (like totals, means and functions thereof). The finite population is considered fixed and the source of randomness (other than the nonresponse mechanism) is the sampling design. In this sense, our approach is mainly design based. Below we introduce the framework of this approach.

Framework
Let U ¼ f1; . . .; k; . . .; N g be the set of labels identifying the units of a finite population of interest and let y k ¼ ðy 1k ; . . .; y mk ; . . .; y Mk Þ be the value taken on unit k 2 U by an M-vector of variables of interest y. We are interested in estimating its population total t y ¼ P U y k . We will use the shorthand P A for P k2A , with A U an arbitrary set. To this end, a sample s of fixed size n is selected from U using a sampling design p(s) with first and second order inclusion probabilities p k ¼ Pðk 2 sÞ and p kl ¼ Pðk; l 2 sÞ, respectively, for k and l 2 U : Let I k ¼ 1 be the indicator variable for unit k selected in the sample, so that PðI k ¼ 1Þ ¼ p k ¼ EfI k jF g, where F ¼ fy 1 ; . . .; y N g is the finite population. Let d k ¼ p À1 k and d kl ¼ p À1 kl ; for k and l 2 U : Assuming p k [ 0; for all k 2 U , the Horvitz-Thompson estimator (Horvitz and Thompson 1952) We denote by r s the set of survey respondents and let the response indicator R k ¼ 1 if unit k 2 r and 0 otherwise. Thus r ¼ fk 2 sjR k ¼ 1g and the response mechanism is given by the probability distribution q(r|s). The random indicator variables R k are assumed independent of one another and of the sample selection mechanism. Then, the probability of responding of unit k to the survey is given by This is the two-phase approach where the sampling design is the first phase and the response mechanism is the second phase; it is based on the quasi-randomisation model of Oh and Scheuren (1983) (see also Särndal et al. 1992, Chapter 9). If the p k 's were known, the double expansion estimator would be unbiased for t y . In practice, a two-step procedure is applied in which the response probabilities p k are estimated assuming a response model and using data from the sample, and then applied to modify the basic design weight, d k ; k 2 s. The prefix "quasi" is added to emphasise that inference depends not only on the design, but also on the assumed response model.

Calibration
Let x be a set of auxiliary variables whose value x k ¼ ðx 1k ; . . .; x qk Þ is known for k 2 r and whose population total t x is known or can be unbiasedly estimated with the Horvitz-Thompson estimator b t HT x ¼ P s d k x k . Let t H x denote the population total of x or its Horvitz-Thompson estimator accordingly. As opposed to the two-step approach, calibration pursues the construction of a set of weights w c k to be used for all variables of interest using a single-step approach. In particular, it modifies specified initial weights (usually the d k 's), while satisfying benchmark constraints on known auxiliary information, i.e. X No explicit model is specified for the treatment of the nonresponse mechanism; it is implicitly given by the calibration procedure. In fact, in the calibration approach the initial weights d k are adjusted by a factor Fðx k c c Þ that depends on the auxiliary information and using a differentiable known function FðÁÞ. The final set of weights is given by where c c is estimated using the constraints given in (1). In particular, b c c is the value of c c for which X r d k Fðx k c c Þx k À t H x ¼ 0: Newton-Raphson methods can be used to this end. See e.g. Kott (2006, Sect. 4). Then, the calibration estimator of t y is Fuller et al. (1994) and Särndal and Lundström (2005) consider "linear" calibration, i.e. Fðx k c c Þ ¼ 1 þ x k c c , while Folsom (1991), Folsom and Singh (2000) and Kott (2006) consider "nonlinear" calibration, in which FðÁÞ is a nonlinear function as, e.g., for raking ratio Fðx k c c Þ ¼ expðx k c c Þ: The function FðÁÞ plays an important role in calibration for nonresponse. Different specifications of the function FðÁÞ have been suggested since the introduction of calibration in the paper by Deville and Särndal (1992). Note that, in the case in which nonresponse is not present, the effect on the final estimate of these alternative choices is limited. In fact, in that case b c c converges in probability to 0, when n ! 1; N ! 1, and Fð0Þ ¼ F 0 ð0Þ ¼ 1, where F 0 ðÁÞ is the first derivative of FðÁÞ, and FðuÞ % 1 þ u, i.e. all choices of FðÁÞ provide estimators that are asymptotically equivalent to the one based on linear calibration (Deville and Särndal 1992). In the case of nonresponse this is not the case anymore, since b c c does not converge to 0 (see also Haziza and Lesage 2016).
The choice of FðÁÞ is very relevant in the presence of nonresponse also because it influences significantly the properties of the final estimator. In fact, the calibration weight adjustment implicitly estimates the inverse of the probability of response, as p k is implicitly assumed to be the inverse of FðÁÞ. Calibration provides double protection against nonresponse bias in the sense of Kim and Park (2006). That is, if either a linear prediction model or the implicit selection model holds, then the resulting estimator has a negligible bias. Fuller et al. (1994) noted this property for the case of linear calibration: the estimator is nearly unbiased under the joint response-sampling mechanism that treats response as a second phase of random sampling and as long as the probability of response is such that p k ¼ ð1 þ x k c c Þ À1 . Kott (1994) introducing double protection in a survey setting (see also Haziza and Rao 2006) notes this equivalence for the regression estimator. Therefore, linear calibration can provide unrealistic values for such implicitly estimated probabilities, since weighting adjustments allow the implicit estimated selection probability to be less than 0 or greater than 1. For this reason, nonlinear calibration based on generalisations of raking is usually employed. See Kott and Liao (2012) for a discussion of the properties of alternative functions FðÁÞ. In addition, Haziza and Lesage (2016) provide a thorough discussion and simulation results of alternative choices of such functions when using calibration in nonresponse.

Generalised calibration
As discussed so far, no discrimination is made within the set of auxiliary variables available: a single set of variables is employed at the same time for nonresponse treatment, sampling error reduction and coherence among estimates. There are cases, however, in which one has a good reason to believe that nonresponse depends on a set of p model variables z, whose components need not coincide with the components of the calibration vector x. This can be accommodated in the calibration framework using "instrument vector" or "generalised" calibration introduced by Deville (2000) (see also Kott 2006) in which weights are given by where b c is an estimator of a set of parameters c such that, again, The model variables z are assumed to be related to the nonresponse propensity and are often called "instrumental variables". Note that we use this term following the survey sampling literature on the topic, rather than the econometric literature on sample selection models, under which x would be called instrumental variable, and model variables z are more closely related to those variables that satisfy "exclusion restrictions" (see Lesage et al. 2019, for more details on this topic).
A solution for c is most easily found when the number of model variables z (p) coincides with that of the calibration variables x (q). Then, given FðÁÞ and b c, the generalised calibration estimator of t y is b t GCAL y ¼ X r d k Fðz k b cÞy k : In particular, in the case of linear calibration, Then, the estimate of the total reduces to a sort of regression estimator such that b t GCALÀlin Chang and Kott (2008) and Kott and Liao (2017) for methods to find an estimate also when p\q, and Kott (2014) for a review on calibration weighting when model and calibration variables differ.
Note that the value of the model variables needs to be known only for k 2 r in order to compute the estimates b c and the final set of weights w k : For this reason, generalised calibration is particularly useful for nonresponse treatment, because unlike other reweighting methods, the problem can be addressed even when the variables that cause nonresponse are known only for the respondents. This is particularly relevant when nonresponse is nonignorable, i.e. when the topic of the survey and, therefore, the variables of interest and/or other variables influence the response probability of a unit, and we observe such variables only on the respondents. In fact, the possibility of introducing the variables of interest (known only for the respondents) as instrumental variables as a means of correcting this type of nonresponse has been investigated in the literature. See Deville (2000); Kott and Chang (2010). This aspect is being further addressed to investigate the properties of the generalised calibration estimator that uses the variable of interest in the instrument vector (Lesage et al. 2019).

Variance estimation for generalised calibration
Lett GCAL ym ¼ P r d k Fðz k b cÞy mk denote the estimator of the total t ym of a particular variable of interest y m and lett GCAL ym ¼ P r d k Fðz k cÞy mk . Following Kott (2006, p. 138), an estimator of the variance oft GCAL ym can be derived from the form of the asymptotic variance oft GCAL ym . Let Then, under the two-phase approach, the variance oft GCAL ym is the sum of two variance componentsthe sampling and the nonresponse variance component. That is, Then, a variance estimator oft GCAL ym is given byV Obtaining b V sam for a particular design may be cumbersome because one has to start from the respondents set. The "reverse" approach proposed by Shao and Steel (1999) considerably simplifies the task. In this framework, the finite population is first divided into two stratathe respondents and the nonrespondentsusing the response mechanism (first phase). Then, within the population of respondents, the sample is selected using the sampling design (second phase). Since we assume that the response mechanism is independent of the sampling process, the two terms can be exchanged in the decomposition of the variance. Therefore, the variance of a generic estimatort ym can be written as where R N ¼ ðR 1 ; . . .; R N Þ, and the conditional distribution given R N is the distribution over the sampling mechanism treating the response indicators R k ; k 2 U as fixed.
Following Kim and Riddles (2012, Sect. 4) we can estimate the variance oft GCAL ym with that of its linear approximation. This can be written as Since whereĝ H mk is such that c andã m are substituted with b c and b a m , respectively, in g H mk . Now, to estimate V 2 ; note that Note that the second term in the reverse approach is negligible compared to the first term if n/N is negligible. The two variance estimators considered above require computation of coefficients and residuals for each variable of interest. Since we are usually interested in estimating the total of a whole set of variables of interest, a replication based variance estimator will also be considered. In particular, a jackknife variance estimator where n r is the size of r, ðlÞ d k ¼ 0 when k ¼ l and ðlÞ d k ¼ n r d k =ðn r À 1Þ otherwise (see Kott 2006). Note that the jackknife variance estimator is an estimator of the first term of the reverse approach V 1 assuming with replacement sampling. In addition, b V jack ð b t GCAL ym Þ is consistent for V 1 even if the nonresponse model is incorrectly specified.

Latent variable models
Our proposal suggests to use latent variable models to obtain a suitable model variable to be used in generalised calibration, in surveys where such latent variables can be computed. Latent variable models are multivariate regression models that link continuous or categorical manifest response variables to unobserved latent variables (see Skrondal and Rabe-Hesketh 2007, for a survey). Several methods exist to determine such latent variables depending on the nature of the response variables. According to the nature of the latent phenomenon, we can distinguish between latent class and latent trait models. Latent classes refer to the categories of a discrete latent variable; such categories may, but need not, be ordered along a continuum. Latent traits, on the other hand, refer to a latent continuum which all individuals, based on their pattern of responses to a set of observed variables, are mapped on.
For example, we could use latent class models as a method of grouping survey respondents with respect to some underlying, unobservable variables, using responses to categorical manifest variables, such as "Do you have enough money to keep your home in a decent state of repair?", with responses coded as either "have or don't want or need" or "can't afford", included into a survey on below average income households. Alternatively, we could use continuous latent variables which are unmeasured variables that are believed to influence responses to a survey and are related to the variable of interest or the survey topic, such as "willing to respond to the survey" or "reluctance to provide confidential information". Other examples of applications of these ideas are discussed in Sect. 4.2. We now briefly present below the two types of latent variable models considered in this paper: latent class models and latent trait models.

Latent class models
In latent class models, the latent variable is supposed to be discrete (Lazarsfeld and Henry 1968;Goodman 1974). Latent class models allow manifest variables to be polytomous (ordered or unordered). Chapter 6 in Bartholomew et al. (2011) gives an overview of latent class models.
Let the latent class variable be denoted by # k , a particular latent class by c and the number of latent classes by C. Let the vector of manifest variables observed for k 2 r be denoted by . . .; h L Þ refers to a possible response pattern. Components of the manifest vector may include components of the response variable vector y k and of the model vector z k . Then the latent class model can be expressed as in which the probability of observing a response pattern h is a weighted average of class-specific probabilities. In fact, the term Pðx k ¼ hj# k ¼ cÞ is the conditional response probability of observing pattern h given that unit k belongs to class c, and the weight is the probability that such unit belongs to the latent class c.
Conditional to the latent variable, the observed variables are assumed to be independent within latent classes, and the conditional probability of unit k takes the form The model can be extended to allow for the presence of covariates that influence the probabilities Pð# k ¼ cÞ.
The parameters of a latent class model are usually estimated by means of the EM algorithm. The number of latent classes may be chosen using model selection criteria like AIC, BIC or cAIC. The final step in a traditional latent class analysis is to use the estimated parameters of the model to classify cases into the appropriate latent classes. The classification problem of assigning respondents to latent classes may be approached from a Bayesian point of view. In fact, posterior probabilities Pð# k ¼ cjx k ¼ hÞ can be estimated and the final estimate of # k is given by the class with the largest posterior probability given an observed response pattern.
Note that in the context of treatment of unit nonresponse in sample surveys, classification of units in latent classes provides an alternative way of building response homogeneity groups. In this case our latent variable h k ¼ ðh 1k ; . . .; h Ck Þ will be an indicator variable vector such that h ck ¼ 1 if # k ¼ c, and 0, otherwise (see the application in Sect. 6). When classes can be ordered, # k can also be used as is as a latent variable (see the first simulation study in Sect. 5.1).

Latent trait models
For ease of notation, we consider L binary manifest variables and let x k be, again, the vector of these manifest variables observed for k 2 r. Denote by . . .; h jk ; . . .; h Jk Þ is the value taken on unit k by the vector of J \L continuous latent variables. A latent trait model (see e.g. Bartholomew et al. 2002) is defined as where, for model identifiability, the location and scale parameters of h k are set to prespecified values, usually 0 and 1, respectively. Model (9) is also referred to as a two parameter logistic Rasch model, and is essentially a logistic regression except that the h k 's are not observed. Covariates may be introduced in the process of estimating the latent variable h k : Extension to ordered categorical manifest variables can be accommodated using Polytomous Rasch Models and Partial Credit Models (Fischer and Molenaar 1995). When J ¼ 1, it is assumed that the L manifest variables measure a single latent variable. The probability of success q k' is assumed to be a monotonic increasing function of the latent trait h k (monotonicity assumption). In addition, under the assumption of conditional independence, the values of the L manifest variables are assumed to be independent for each k given h k . Different goodness-of-fit measures of Model (9) are available in the literature to check whether these assumptions hold (see e.g. Bartholomew et al. 2002). Note that, however, these assumptions are particularly relevant when the objective of the study is the estimation and interpretation of the model parameters. In our perspective, the main objective is prediction and the ability to obtain a good reduction of the information from the manifest variables, to be used as members of the instrumental vector. The parameters of Model (9) can be estimated using different methods: maximum likelihood, generalised least squares, EM algorithm or Markov Chain Monte Carlo. The book of Bartholomew et al. (2011) provides a good description of these methods (Sect. 4.5 for an EM algorithm, Sect. 4.8 for the generalised least squares, and Sect. 4.11 for Markov Chain Monte Carlo methods).

The proposed methodology
Once an estimate of the latent variable is obtained, either using latent class or latent trait models, it can be used in a generalised calibration framework as the instrumental variable z or as a component of the vector of instrumental variables. The choice between latent trait models over latent class models is essentially dictated by the application at hand. Other types of latent models may also be used, according to the type of manifest variables we have available.
In this section we wish to discuss the properties of b t GCAL y ¼ P r w k y k , when a latent variable is used in the construction of w k : First note that, since the components of z k come from the survey and may include information on the survey variables of interest, properties that involve a superpopulation linear model between a particular survey variable, say y m , and the auxiliary variables x are more problematic to establish, rather than in the simpler calibration case (see Kott 2006;Kott and Chang 2010). Then, we will focus on the response model. In a multipurpose setting, in addition, this is a sensible choice also because a good model for one survey variable may not be as good for another survey variable.
A necessary condition, therefore, under which b t GCAL y is a consistent estimator for t y , is that p k ¼ Fðz k cÞ À1 . This condition has two main ingredients: the function FðÁÞ, that we have discussed already, and the choice of z. On this regard, Lesage et al. (2019) show that in order to achieve consistency, it is also required that the response mechanism is independent of x given z and, similarly, the response mechanism is independent of each element of y given z. In other words, the instrumental variables should extract all the relevant information that links the response process with x and with the response variables. Note that, in practice, it is usually not possible to validate the choice of FðÁÞ and/or the choice of the model variables in z, and their relationship with the survey and the auxiliary variables. In fact, model variables are only available for the respondents. In our opinion, external data on the nonrespondents and/or on the population should be available in order to validate such assumptions. See, for example, the application in Sect. 6.
As of variance estimation, we will consider the estimators proposed in the literature and reviewed in Sect. 2.4. We tested these estimators in the simulation studies in Sect. 5.

Motivation of using latent variables
There are several reasons why we find it is useful to introduce latent variables in generalised calibration. The first, and more obvious, is to reduce nonresponse bias compared to the classical calibration estimator, because it introduces information on the nonresponse behaviour when we believe that nonresponse depends on unobservable variables that can be estimated from a set of manifest variables using latent variable models. Of course this can be achieved also using generalised calibration using all the manifest variables as model variables. However, we will show in the simulation studies in Sect. 5, that using latent variables reduces considerably the variance of the generalised calibration estimator as opposed to including the variable(s) of interest. Note that this is true also when nonresponse depends on the variable of interest. In fact, it is known that in order to reduce the bias and the variance of the generalised calibration estimator, the model and the calibration variables should be highly correlated. Lesage et al. (2019) show that the generalised calibration estimator has a problem of variance amplification if the calibration variables and the model variables are not correlated, as this implies that the vector c is poorly estimated.
Using information from the survey variables of interest in the instrumental vector is shown to provide a trade-off between bias reduction and variance increase. The use of latent variables provides a very useful tool to achieve a compromise since we use an instrumental variable that is well correlated with both the variable of interest and the calibration variables. Note that calibration variables can be used in the estimation of the latent variables, by this increasing correlation and decreasing variance (see the first simulation study in Sect. 5.1).
In addition, depending on the context, a latent variable may be included as single instrumental variable, or together with other instrumental variables in generalised calibration. Consider the following examples. In attitude and behavioural surveys, nonresponse may be related to the survey topic itself (see Groves et al. 2006). It is the case of surveys on attitudes towards politics or abortion. In this context, a latent variable called "attitude towards politics" or "attitude towards abortion" can better explain the response probability of a unit to answer the survey questionnaire than any observed variable. Thus, it is possible to model the response probability as a function of this latent variable.
A similar example concerns a latent variable called "will to respond to the survey" which can be computed from item response indicators to the observed variables (e.g. Moustaki and Knott 2000;Matei and Ranalli 2015). The "will to respond to the survey" latent variable is well measured when nonresponse is nonignorable; e.g. in attitude surveys or in income surveys. Since this latent variable is a measure of the will to answer the survey, it can be assumed that response probabilities depend on it. As in the previous example, the "will to respond to the survey" latent variable may be included as an instrumental variable in the generalised calibration method.
On the other hand, most surveys are multipurpose. It is often difficult to make the assumption that nonresponse depends upon a single variable of interest. Again, the use of few latent variables which capture the behavior of variables of interest is useful because it reduces the dimensionality of the problem and allows for bias adjustment, without increasing the variance as much as when several variables of interest are used. In addition, note that the use of several variables of interest as model variables largely increases the variability of the final set of calibration weights. Finally, latent variables are a useful mean of dealing with the presence of measurement error in the manifest variables. Indeed, the latent variable approach is widely used to deal with such issues; see e.g. Biemer (2009).
Section 6 provides evidence with regard to both these issues using data from the Survey on Household Income and Wealth (SHIW). It is conducted by the National Central Bank of Italy every two years and collects detailed information on household income, wealth and consumption. In these instances, it is not sensible to assume that unit nonresponse depends only on a single surveyed item. In the SHIW survey, some households may refuse to participate because they don't want to report their income from self-employment. They may fear the consequences of disclosure of the answers to tax authorities. Others may not report all the houses they own because of social desirability bias or because they fear they could be robbed. Others may not be willing to participate because they don't want to report their financial wealth. Indeed, the survey collects information on two topics that are generally considered highly sensitive: income and household wealth. Therefore, it is reasonable to assume that each of these factors could contribute to nonresponse. A second issue that arises in the case of SHIW, is that of measurement error, since previous research has shown that both income and wealth are affected by underreporting (Neri and Zizza 2010;Neri and Ranalli 2011). Failing to consider such an issue may result in biased estimates. In this context, it is possible to create response homogeneity groups which are here in fact latent classes. The latter are constructed using survey responses as proxies of true wealth, and used in the generalised calibration method as instrumental variables.

Simulation studies
We present two simulation studies. The first one uses a categorical latent variable, while the second one a continuous one. In both studies, for a generic estimator b t y the following measures are computed: b t yi is the value of the estimator b t y at the i-th simulation run and nsim is the total number of simulation runs; • the Monte Carlo Variance • the Monte Carlo Mean Squared Error MSE ¼ B 2 þ VAR:

Simulation using a categorical latent variable
The first simulation study is based on the 'election' data available in the R package poLCA (Linzer and Lewis 2014). This data set reports information gathered during the 2000 American National Election Study public opinion poll. Following Linzer and Lewis (2011), three latent classes are constructed based on two sets of six items concerning a series of traits (moral, caringreally cares about people like you, knowledgable, strong leadership, dishonest, and intelligent) potentially describing the candidates Al Gore and George W. Bush: 'a reasonable theoretical approach might suppose that there are three latent classes of survey respondents: Gore supporters, Bush supporters, and those who are more or less neutral'. Each question has four possible choices: (1) extremely well; (2) quite well; (3) not too well; and (4) not well at all. The party respondent's identification (seven different parties, from strong Democratic to strong Republican, with Independents at 3-4-5 on the scale) is used as covariate x in the construction of the three latent classes. After deleting observations with missing values, the population size is 880. The model predicts that 26% of the units are Gore supporters, 35% of the units are Bush supporters, and 39% of the units are neutral. Eight observations considered as badly classified into the three classes have been removed from the population. Thus, N ¼ 872: The variable of interest y is the fourth item in this data set (the question is 'In your opinion, does the phrase he is knowledgeable describe Al Gore extremely well, quite well, not too well, or not well at all?').
We focus on estimation of the population total of people answering 'extremely well' or 'quite well' to the fourth item. The variable of interest y is, thus, dichotomised, taking value 1 if the answer is 'extremely well' or 'quite well' and 0 otherwise. The population total is t y ¼ 476: A number of 10,000 samples of size n ¼ 300 have been drawn from the population using simple random sample without replacement. From the selected samples, the respondents sets were generated using Poisson sampling with probabilities p k ¼ Fðz k cÞ À1 : The response probabilities p k were computed using four response models, respectively, where z is replaced by # or y: Model 1: p k ¼ 1=ð1:1 þ 0:9# k Þ; Model 2: p k ¼ 1= expð0:2 þ 0:5# k Þ; Model 3: p k ¼ 1=ð1:5 À 0:3y k Þ; Model 4: p k ¼ 1= expð0:4 À 0:2y k Þ: For each response model, the population average for p k is around 0.7. The population correlation coefficient between the auxiliary information x (here 'Party') and y is À0:56, while between x and # is 0.75; the correlation coefficient between y and # is À0:63: Corresponding to Models 1, 2, 3, and 4, there are respectively four link functions to use in generalised calibration: LinÀ# : Fð# k cÞ ¼ 1:1 þ 0:9# k ; RakÀ# : Fð# k cÞ ¼ expð0:2 þ 0:5# k Þ; LinÀy : Fðy k cÞ ¼ 1:5 À 0:3y k ; RakÀy : Fðy k cÞ ¼ expð0:4 À 0:2y k Þ: We used two different forms of the link function (linear and exponential) since the choice of FðÁÞ is important as underlined in Sect. 2. To test the performances of the different weighting systems, three estimators were computed: a generalised calibration, where z is the latent variable # (denoted below by GCALÀ#), a generalised calibration, where z is the variable of interest y (GCALÀy), and a simple calibration estimator, where z is the auxiliary variable x (CALÀx). Table 1 reports the simulation results considering the previous Monte Carlo measures. The (generalised) calibration weights have been computed using the appropriate link function in the R package sampling (Tillé and Matei 2021) and then applied to estimate t y : Finally, variance estimators have been considered as well. Note that in this particular setting in which we use simple random sampling without replacement to select the sample, and we use population level auxiliary information, we have found out that b V 2p and b V rev coincide. Appendix A provides a proof of this equivalence. We will denote simply by b V in Table 1 the result from these estimators. Model 1, Model 2, Model 3 and Model 4, corresponding to Lin À#; Rak À#; Lin Ày and Rak Ày, respectively are the true response models. Note that a true response corresponding to the true response models. In Table 1 the three estimators are given in the column 'Working model', below each FðÁÞ corresponding to a true response model. In Model 1 and Model 2, z is taken to be #, following thus theoretical approaches where political partisanship is strongly correlated with attitudes and behavior and thus can influence the probability to answer a political survey. In Model 3 and Model 4, z is the variable of interest y. The last two models represent a tool to test the robustness of GCALÀ#, since p k depends on y k , and not on # k : Following the results given in Table 1, we note that when the response model is based on # (Lin À# and Rak À#), GCAL-# has, as expected, a negligible bias and a smaller variance than GCAL-y. It also reduces the bias compared to CAL-x, since # is more correlated with the variable of interest y than x. On the other hand, as expected, its variance is larger than CAL-x. It also shows a larger mean square error than CAL-x, but of similar size. GCAL-# provides thus a good compromise: it shows smaller bias than the classical calibration estimator CAL-x and smaller variance than GCAL-y.
When GCALÀy is based on the true response model (Lin Ày and Rak Ày) its bias is negligible, but its variance is relatively larger than that of the other estimators, including the classical calibration estimator. The classical calibration estimator CALÀx has the opposite behavior, with a large bias and a small variance that places concern on the coverage rate. This behavior of the generalised calibration estimator versus the classical calibration one is also verified in other simulation studies given in the literature (see e.g. Osier 2013). When the response model is based on y (Lin Ày and Rak Ày), GCAL-# also provides the previous good compromise: it shows smaller bias than CAL-x and smaller variance than GCAL-y. But also its mean square error is smaller than the mean square errors of GCAL-y and GCAL-x. Table 1 also reports the performance of the variance estimators. For each variance estimator, its Monte Carlo expectation is computed, together with the empirical coverage rate of a 95% confidence interval. We note that b V has better behavior than b V jack in terms of Monte Carlo expectation. However, better coverage rates are provided by b V jack when the working model is misspecified. In fact, b V shows lower coverage rate when the response model depends on y and it is not the true one for CAL-x and GCAL-#; this is due to the presence of a larger bias displayed by the estimators CAL-x and GCAL-#. In some instances, the coverage rate of b V jack is above 95%. This may be due to a relatively high sampling fraction, 300=872 ¼ 34:4%. Since b V jack does not incorporate the finite population correction, the first term V 1 is overestimated and this compensates the underestimation of the second term V 2 of the reverse approach that is not negligible in this case.

Simulation using a continuous latent variable
We consider the 'Abortion' data set formed by four binary variables extracted from the 1986 British Social Attitudes Survey and concerning attitudes to abortion. N ¼ 379 individuals answered the following questions after being asked if the law should allow abortion under the circumstances presented under each item: 1. The woman decides on her own that she does not want to keep the baby. 2. The couple agrees that they do not wish to have a child. 3. The woman is not married and does not wish to marry the man. 4. The couple cannot afford any more children.
The data are analyzed in Bartholomew et al. (2002) and found to hide a latent continuous variable, interpretable as the "attitude to abortion". It is plausible in this example to assume that the response probability of a unit k to the survey depends on its attitude to abortion h k , rather than on a single variable of interest enumerated before. In our study, we will consider both cases.
We focus on estimation of the total of the second variable in this data set (y), having in the population t y ¼ 225: At the population level, the latent variable h k has been estimated for each unit using Model (9) and the R package ltm (Rizopoulos 2006). The method used for estimation is marginal maximum likelihood. An auxiliary variable x k has been generated as x k ¼ 1 þ P 4 m¼1 y mk þ k , with k $ N ð0; 1Þ. Its population total is assumed known. The simulation data frame consists of N ¼ 379 observations with binary answers the previous four items, associated latent variable h k and auxiliary information x k : The correlation coefficient between x and h, and between y and h is around 0.8 in the population, respectively.
We draw 10,000 simple random samples without replacement of size n ¼ 100 (similar results not shown here have been obtained using the same settings, and a sample size n ¼ 50). For each selected sample, nonresponse was simulated using Poisson sampling with probabilities defined for each unit according to two different models: Model 5: p k ¼ 1= expð0:1 þ 0:5h k Þ; Model 6: p k ¼ 1= expð0:1 þ 0:5y k Þ: The corresponding link functions in generalised calibration are respectively: RakÀh : Fðh k cÞ ¼ expð0:1 þ 0:5h k Þ; RakÀy : Fðy k cÞ ¼ expð0:1 þ 0:5y k Þ: In both models, the population average for p k is around 0.7. As in Sect. 5.2, three estimators were computed: a generalised calibration, where z is the latent variable h (denoted below by GCALÀh), a generalised calibration, where z is the variable of interest y (GCALÀy), and a simple calibration estimator, where z is the auxiliary variable x (CALÀx). In particular, by using GCALÀh; we wish to investigate the ability of the proposed approach to reduce bias when Model 5 is used to create the set of the respondents in the population, and its robustness in terms of bias and variance in the case when Model 6 is employed. In this example, only the exponential link function was used. Similar results (not shown here) have been obtained using the linear case. Table 2 shows the results for 'abortion' data. Estimators behave similarly to the first simulation study. The generalised calibration estimator GCALÀh has, as expected, a very good performance in the Rak Àh setting, but is also a very good compromise in the other case. We also note that when the response model depends on y (Rak Ày) the use of h k as model variable shows good behavior in terms of both bias and variance compared to GCALÀy and CALÀx: In particular, we can see, again, that the proposed approach is robust also if nonresponse depends only one variable of interest. In addition, it is always much less variable that the one that uses the variable of interest. Note that the estimator provided by our approach shows, in these settings, smaller values of the mean square error than those given by the calibration estimator that uses only classical auxiliary information. The variance estimators behave similarly to the the first simulation.

Application to the Italian survey on household income and wealth
The Survey on Household Income and Wealth (SHIW) is conducted by the National Central Bank of Italy every two years to study the economic behaviours of Italian households by collecting detailed information on their income and wealth. Our goal here is to make inferences about the average household net wealth for the Italian population in 2008. We use the 2008 wave since for this survey we have access to auxiliary information that can be used to better explore the properties of our estimator. The sample consists of about 8000 households selected from population registers using a complex two-stage (municipality-household) sampling design. The survey also collects paradata on the recruitment process of households, such as information on the call attempts and on the characteristics of the dwelling, such as exterior appearance and the overall assessment of the area where the building is Table 2 Second simulation study. Monte Carlo Bias, Variance and MSE for each estimator in each simulation setting. Monte Carlo mean and 95% confidence interval coverage for each variance estimator. Abortion data set, n ¼ 100; N ¼ 379 Rak Àĥ located. Because of the sensitiveness of the issues surveyed, measurement error and unit nonresponse are two major components of total survey error. On the contrary, item nonresponse is negligible because of the design of the survey (in case the respondent does not provide information on income or wealth, the overall interview is not accepted). Previous research based on SHIW data shows that nonresponse is nonignorable and depends on the true wealth (for more details see D' Alessio and Faiella 2002;Neri and Ranalli 2011;D'Alessio and Neri 2015). It should be noted that the true wealth is not observed either for respondents because of measurement error. Our approach uses survey responses related to wealth as proxies of true wealth and builds latent classes to be used as response homogeneity groups. In particular, we use the following variables to build the vector of manifest variables x k : the individual observed wealth class (ordinal variable with five levels), the number of total call attempts needed to make the interview (ranging between 1 and 4) and six dummy indicators for the ownership of a secondary dwelling, of bonds, of agricultural and of non-agricultural land, of other non-residential buildings and for the household living in a deluxe dwelling. Overall, the vector of manifest variables x is made of L ¼ 12 components.
Using latent class analysis, we classify respondents into five latent classes. We used the BIC measure to select the number of classes to use. The value of the BIC for the six class model was slightly smaller than the one for five. However, the sixth class was very small and not very different from another one, and so we decided to use a classification in five latent classes. The latent classes can be interpreted in terms of the household's economic well-being. The first class is formed by the very rich and it represents about 16% of the sample. It is mainly made of individuals with a high level of financial and non-financial wealth, living in a luxurious residence and who are difficult to contact or to interview. The second class can be named the welloff: it includes households with a medium-high wealth, living in a luxurious residence, with high financial wealth, and who are easy to contact or to interview (about 30% of the sample). The third class is made of average households (average wealth, owners of some land and non-residential buildings, easy to contact or to interviewabout 16% of the sample). The fourth class is composed of below average households (about 25% of the sample). They have low levels of wealth but they are easy to contact or to interview. Finally, the last class is made of the very poor. Households in this class have a low wealth, almost zero financial wealth and they are difficult to contact or to interview (about 13% of the sample). Note that the latent classes are not strictly ordered. This is another reason why we have decided to use a latent class model, instead of a latent trait model: in this application one latent trait was not fitting the data well, by this implying that these manifest variables hide a multidimensional structure.
The predicted latent class memberships h k are used as model variables z k . As calibration variables, we use two sources of auxiliary information. The first one is the Italian National Statistical Office and consists of the distribution of individuals according to some demographic variables: age (5 classes), gender, education (3 levels), nationality (italian/foreigner), job status (employed/unemployed/inactive), geographical area (north/centre/south). The second one is the Italian Department of the Treasury holding administrative records of real estate owners and consists of the distribution of individuals according to the value of the owned dwelling (5 classes). Overall, the vector of population totals t x is made of q ¼ 18 components.
Six different approaches to estimate the mean household wealth are compared: (1) the Hajek estimator P r d k y k = P r d k (without nonresponse adjustment); (2) a double expansion estimator in which response probabilities are estimated via a logistic model that uses covariates known also for nonrespondents (see details in Neri and Ranalli 2011); (3) the classical calibration estimator using the aforementioned 18 calibration variables; (4) the generalised calibration using the predicted latent class memberships b h k as model variables; (5) the generalised calibration using twelve manifest variables x k as model variables; (6) the generalised calibration using the individual observed wealth class (classes of y k ) as model variables.
For cases (4), (5) and (6), since p\q, we use a two-step generalised calibration approach to obtain final weights w k ¼ d k Fðz k b c 1 ÞFðx k b c 2 Þ: The first step is the adjustment step Fðz k b c 1 Þ obtained using the method proposed in Chang and Kott (2008). Since it does not provide calibrated weights, in the second step Fðx k b c 2 Þ is obtained calibrating on t x using d k Fðz k b c 1 Þ as starting weights. The inverse of the two step adjustment Fðz k b c 1 ÞFðx k b c 2 Þ provides a measure of the implicit response probability estimated with generalised calibration. Table 3 reports the mean value of 1=Fðz k b c 1 ÞFðx k b c 2 Þ computed for household wealth and income percentiles. As expected, the higher the household income or wealth, the lower the estimated response probability. Households in the lowest quintile of both distribution have a response propensity of about 0.8. This figure drops to about 0.4 in the upper tail of the distribution. The relation is almost monotonic as far as wealth is concerned. As to income, the drop in the response probability is mainly due to households with a median-high income.  Table 4 reports the results for all the approaches with an estimate of the standard error using jackknife. Nonresponse adjustments via calibration all provide an increase in the estimate. This is reasonable since we expect wealthier households to be less cooperative. It is worth noting that the true mean of the households' wealth is unknown for this application. In fact, there is a macro estimate of total household wealth produced the Bank of Italy using different sources that do not include the SHIW survey. For 2008 this estimate is about 8,500 billions of euro, i.e. 368,840 euros mean estimate. We do not use this information in calibration because of the differences in definitions between the two sources. Moreover, the macro estimate is constructed using also statistical models and assumptions. Nonetheless, it can provide some insights on the goodness of our approach. In Table 4 the ratio of the estimates to this value is also provided. It can be noted that all estimates based on generalised calibration get closer to the macro one.
The estimated variance of the generalised calibration estimators ð4Þ À ð6Þ is larger than that of classical calibration (3), because of the increased complexity in the first step nonresponse adjustment. As already noted in the simulations, generalised calibration that uses the variables of interest as model variables (cases (5) and (6)) provides more variable estimators than the one that uses latent variables. By comparing (5) with (4), it seems that the reduction in dimensionality performed by latent class analysis allows for a more stable estimator, without losing too much in terms of information. This is also true when comparing (4) and (6), that have the same number of model variables. Finally, note that (4) shows a far less variable set of weights as opposed to (5) and (6).

Conclusions
This paper proposes the use of latent variables in generalised calibration to handle nonignorable unit nonresponse in those surveys where such latent variables can be computed. The method is tested on a series of simulation studies and applied to adjust estimates from the Italian Survey on Household Income and Wealth. We find that the method presents some merits compared to the main current proposal available in the literature, which consists in using the variable of interest as an instrument in generalised calibration. This approach has at least two main pitfalls. First, nonresponse is likely to depend not just upon the variable of interest. This is particularly the case with multipurpose surveys. One may also try to add other observed variables, but this usually increases the complexity of the estimator and, therefore, the variability of the estimates and of the final set of weights. A second pitfall is that observed variables may be affected by measurement error. Failing to consider such an issue may result in biased estimators.
We considered in our examples only a single latent variable as instrumental variable. However, several latent variables can also be used, because the vector z is pdimensional. For example, in the continuous case, in addition to "will to respond to the survey" presented in Sect. 4.2, a second latent variable "reluctance to provide confidential information" can also be employed as instrumental variable, since they could explain more accurately dependence among the manifest variables. In the discrete case, more than one class variable can be used. We borrow from Jeon et al. (2017) an example where adolescent violent behavior and drug-using behaviors are investigated. For this example, nonresponse may be related to the survey topic. Three class variables are provided using data from 4,957 male high-school students who participated into a "Youth Risk Behavior Surveillance System" in 2015: violent behavior (with four classes), alcohol drinking (with three classes), tobacco cigarette smoking (with three classes), and other drug use (with three classes). These class variables can be introduced as instrumental variables in our approach. On the other hand, Jeon et al. (2017) propose a joint latent class analysis, which seeks to explain the joint patterns of two or more latent class variables. For the example provided, three joint classes are identified based on the previous class variables, reducing finally again the approach to the use of a single latent variable.
We find that our approach is a parsimonious method to relax the assumption that nonresponse depends on a single variable of interest. Moreover, the latent variable framework is widely used in the literature to deal with the issue of measurement error. Therefore it allows us to relax the assumption that manifest variables used as instruments are error-free. Our simulations show that the latent variable approach reduces the variance compared to the generalised calibration even when response probabilities depend on the variable of interest and the latter is used as an instrumental variable. It is also more efficient than classical calibration that uses only auxiliary information as model variables.
We have considered latent trait models, which provide a continuous measure of the response propensity, and latent class models, which provide a categorical measure thereof. The choice between the two is essentially dictated by the application at hand. However, some guidelines can be provided. In general, weights based on a small number of classes tend to be more stable than the weights obtained using an estimated propensity. Latent classes provide a model based alternative to the construction of post-strata that are, by construction, built using similar units. In addition, note that very often in surveys, even when a continuous response propensity is estimated, it is not used as is, but it is categorised in classes (see e.g. Haziza and Lesage 2016). We do believe that using latent classes provides a much better and data-driven classification of units, than subjectively setting thresholds of values for response propensities.
We also present an application of the method to survey data relating to household wealth and compare it with other methods. To the best of our knowledge, this is the first application to real data of generalised calibration with variables of interest used as instrumental variables. To avoid the issue of variance amplification as highlighted in Lesage et al. (2019), it is important to use auxiliary variables that are well correlated with the latent variables and the nonresponse behavior. In this way it is possible to estimate better the calibration parameters. In our application, this role has been played by auxiliary information coming from administrative records of real estate owners and consists of the distribution of individuals according to the value of the owned dwelling. More research is still needed on tools to select auxiliary and model variables in generalised calibration.
Of course, this approach suffers from bias when the assumed nonresponse model is misspecified in terms of the model variables or in terms of the link function. The robustness of the generalised calibration estimator in this case is still an open problem. We can never safely assume one thing or the other unless we have external independent data (on nonrespondents) that can help verify these. As an alternative, it is important to have external validation information that we can use as a sort of gold standard to evaluate results. In our application, we have used to this end a macro estimate of the total household wealth produced using different sources that do not include the SHIW survey. We find that the use of latent variables seems to provide a good compromise between bias reduction and variance increase, and helps to make generalised calibration more applicable.
We end the discussion with some cautionary considerations in the application of the proposed methodology. As we underlined, our proposal is suitable for estimation using data issued from surveys where it makes sense to compute latent variables (where the assumptions and usual measures of goodness-of-fit are fulfilled for such models). Sometimes, however, even in these surveys, possibly poor latent variables for generalised calibration can occur, showing a weak correlation with the calibration variables, and/or being not necessarily related to the nonresponse structure. We provide below some practical guidelines to decide between the use a (possibly poor) latent variable or the direct use of the manifest variables: • compute some measure of association and/or correlation between the latent variable(s) and the calibration variables; if their values are low, a simple calibration approach is suitable instead of generalised calibration, and the proposed method should be avoided; • if the measures of association and/or correlation are important, compute two estimates: using the latent variable(s) and using the manifest variables; • for each estimate, compute: the jackknife standard error estimator proposed in Sect. 2.4, the estimated coefficient of variation, and the standard deviation of the final weights. Finally, choose the method providing the best results, minimising these three measures. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/ licenses/by/4.0/.