Variable selection in Propensity Score Adjustment to mitigate selection bias in online surveys

Ferri-García, Ramón; Rueda, María del Mar

doi:10.1007/s00362-022-01296-x

Variable selection in Propensity Score Adjustment to mitigate selection bias in online surveys

Regular Article
Open access
Published: 02 March 2022

Volume 63, pages 1829–1881, (2022)
Cite this article

Download PDF

You have full access to this open access article

Statistical Papers Aims and scope Submit manuscript

Variable selection in Propensity Score Adjustment to mitigate selection bias in online surveys

Download PDF

Abstract

The development of new survey data collection methods such as online surveys has been particularly advantageous for social studies in terms of reduced costs, immediacy and enhanced questionnaire possibilities. However, many such methods are strongly affected by selection bias, leading to unreliable estimates. Calibration and Propensity Score Adjustment (PSA) have been proposed as methods to remove selection bias in online nonprobability surveys. Calibration requires population totals to be known for the auxiliary variables used in the procedure, while PSA estimates the volunteering propensity of an individual using predictive modelling. The variables included in these models must be carefully selected in order to maximise the accuracy of the final estimates. This study presents an application, using synthetic and real data, of variable selection techniques developed for knowledge discovery in data to choose the best subset of variables for propensity estimation. We also compare the performance of PSA using different classification algorithms, after which calibration is applied. We also present an application of this methodology in a real-world situation, using it to obtain estimates of population parameters. The results obtained show that variable selection using appropriate methods can provide less biased and more efficient estimates than using all available covariates.

Design-Unbiased Statistical Learning in Survey Sampling

Article 06 October 2020

Indicators for Monitoring the Survey Data Quality When Non-response or a Convenience Sample Occurs

Sample selection bias with multiple dependent selection rules: an application to survey data analysis with multilevel nonresponse

Article Open access 02 April 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, online surveys have undergone rapid development in a wide variety of fields, including public opinion research (Couper 2000) and life sciences (Thornton et al. 2016; Borodovsky et al. 2018). In contrast to traditional survey modes, which are experiencing issues with response rates (according to Marken (2018), response rates in Gallup Poll Social Series dropped from 28% in 1997 to 7% in 2017) and increasing costs, online surveys offer a faster and cheaper method to measure certain features in individuals. In addition, there is an increasing availability of large sets of data obtained from the Web with automatic procedures (such as web scraping or APIs) that are often used for inference in finite populations.

Non-probabilistic surveys provide us with some advantages over traditional methods but also these surveys have given us many research problems: many problems demand the type of velocity for both data processing and analysis, but the most important is related to the quality of the data. Data quality is far more important than data quantity. Meng (2018) studies some theoretical aspects related to the impact of the application of non-probability online surveys on the estimation quality, and develops a data defect index, a main topic of population inferences from Big Data and concludes it is more important to reduce sampling and non-response biases than non-response rates.

Indeed, non-probabilistic surveys emphasise certain types of nonsampling errors. It is not feasible to obtain a representative sampling frame of the online population except in specific situations where the target population is a well-characterised group (such as company employees or university students each of whom is associated with an e-mail address). For this reason, most online surveys or large volume datasets are based on volunteer samples. In addition, the coverage of this approach is limited by the extent of Internet penetration among the population, which is often subject to demographic characteristics. For instance, according to the Spanish Survey on Equipment and Use of Information and Communication Technologies in Households (National Institute of Statistics of Spain 2018), while 98.5% of the Spanish population aged 16–24 years make regular use of the Internet, only 49.1% of those aged 65–74 years do so. Although the difference has narrowed in the last few years, online surveys are still unable to provide representative samples except when special procedures are used, such as offline recruitment, panels or mixed modes (see Schonlau and Couper 2017 for a review of the available options).

The lack of a probability sampling scheme might lead to significant differences between sampled and nonsampled individuals, which constitutes a selection bias that cannot be redressed with the usual procedures (Elliott and Valliant 2017). Selection bias is a particularly important concern in online surveys because of their intrinsic characteristics (Couper 2000). Statistical adjustments are crucial to obtaining reliable estimates from online survey data; in this context, calibration or Propensity Score Adjustment (PSA) can be used, according to the kind of auxiliary information available. While calibration only needs the vector of population totals for some auxiliary covariates, PSA requires a probability sample drawn from the same target population, even when the nonprobability sample is drawn from a subset of it, which is the case of Internet surveys (not everybody may have access to the Internet in a given population) and imperfect sampling frames in general. This sample is used to estimate the (unknown) participation propensities for the individuals in the nonprobability sample through prediction models. These estimated propensities can be used as inclusion probabilities to build weights for different parametric estimators.

The efficacy of PSA at removing selection bias has been proved, although some considerations should be taken into account. First, PSA is strongly dependent on the covariates used to estimate the propensities. Lee (2006) showed that the use of covariates which are strongly related to the variables of interest in PSA models achieves greater reductions in bias than is the case with nonsignificant variables. Second, further adjustments such as calibration procedures must be applied in order to maximise the effectiveness of PSA (Lee and Valliant 2009; Valliant and Dever 2011; Valliant 2020). Finally, the use of PSA is associated with an increase in the variance of the estimates.

In this study, we focus on the first point raised above: the choice of covariates. Lee (2006) suggested that including all available covariates, as recommended by Rubin and Thomas (1996), might be a reasonable practice. However, statistical models based on modern classification techniques such as Machine Learning algorithms might benefit from feature selection to reduce the complexity of the models (and the variance of their predictions). Variable inclusion in propensity models for treatment weighting has been widely studied (Hirano and Imbens 2001; Brookhart et al. 2006; Austin 2008; Schneeweiss et al. 2009; Austin 2011; Myers et al. 2011; Patrick et al. 2011; Austin and Stuart 2015) and variables are often selected using a stepwise algorithm or they are assessed prior to the study according to their known relationship to the outcome or exposure variables. In this case, better results are obtained when the variables in question are related to the outcome variables or to both the outcome and the exposure variables.

In many real-world applications, there may be very little information about the pre-existing relationships between variables, which increases the difficulty of selecting the best subset of variables for propensity estimation. In the present study, we consider how modern techniques of feature selection (or variable selection) developed for knowledge discovery in data can be used in propensity estimation modelling. These techniques only require an appropriate dataset from which to locate the variables more closely related to a given target variable or that may be more influential with respect to predicted values, according to the behaviour observed in the dataset. The benefits of feature selection, in terms of increased accuracy and reduced computational costs, have been demonstrated in classification tasks (Bolón-Canedo et al. 2013; Xue et al. 2015).

In survey research, feature selection has been studied with respect to the problem of calibration when a large number of variables must be considered. Breidt and Opsomer (2017) reviewed this question and suggested that auxiliary variables for calibration may be too closely correlated or have poor predictive power, and therefore model selection should be employed to improve the estimates obtained and to stabilise the weights. Stepwise and best subsets algorithms have been considered for this purpose, but models from the class of “least absolute shrinkage and selection operator” (LASSO), which perform feature selection by shrinking regression coefficients to zero in non-informative variables, seem to be the most promising methods to improve the weighting. Their efficiency in non-probability samples was highlighted by Chen et al. (2019), who showed that LASSO-weighted estimators have a lower RMSE than PSA-weighted equivalents.

The rest of this paper is organised as follows: Sect. 2 presents the essential aspects of calibration and PSA. The synthetic data and the real survey datasets used in our experiments are then described in Sect. 3. In Sect. 4 we describe the deployment of PSA models with a grid of classifiers and feature selection algorithms for the study data. The results of the experiments in terms of relative bias and efficiency are detailed in Sect. 5, after which the method proposed is applied in a real-world context concerning addiction and dependence, in Sect. 6. Finally, the implications of our findings are discussed in Sect. 7.

2 Adjustments for nonprobability samples

2.1 Calibration

Calibration was developed by Deville and Särndal (1992) as a reweighting method based on the availability of population totals for auxiliary variables measured in a sample, although some later versions addressed missing data situations or the use of dual frames for survey sampling (Ranalli et al. 2016). This adjustment is intended to reduce the coverage error between the target population and the sample, and takes the following form. Let ${\mathbf {x}}$ be a $n \times p$ matrix of p variables measured in a sample of size n, $x_{ij}$ is the value of the i-th individual in the j-th auxiliary variable, ${\mathbf {X}} = (X_1, ..., X_j, ..., X_p)$ are the known population totals for the auxiliary variables and $d = (d_1, ..., d_i, ..., d_n)$ is the vector of design weights of the sample. If a probabilistic unbiased sample from the same population is available, estimated population totals can be used for ${\mathbf {X}}$ as an alternative (see Ferri-García and Rueda 2018 for a study of its efficiency). Calibration then attempts to obtain a new vector of weights $w = w_1, ..., w_i, ..., w_n$ by minimising their distance frp, d (from a class of distances leading to different estimators) subject to the calibration equations:

$$\begin{aligned} \sum _{k = 1}^n w_k x_{kj} = X_j, j = 1, ..., p \end{aligned}$$

(1)

When information on population totals is incomplete, and especially when the cross-classification totals (also known as cell counts) are not known, it can be useful to use the raking ratio as defined in Deville et al. (1993), which takes advantage of the estimation of cell counts from the available data in the sample. Here, let ${\hat{N}}_{ab} = \sum _{k / x_{Ak} = a, x_{Bk} = b} d_k$ be the estimated cell count of ab, which represents the number of individuals whose measured value in the variables A and B is a and b respectively. The raking ratio uses this information to reformulate the calibration equations, thus obtaining the calibrated weights $w_k = d_k {\hat{N}}_{ab}^w / {\hat{N}}_{ab}$, where ${\hat{N}}_{ab}^w = d_k {\hat{N}}_{ab}$ represents the calibrated estimations of the cell counts. The efficiency of calibration procedures depends on the relevance of the auxiliary information in terms of relationship with the target variable and on the mechanism producing the coverage error. Calibration has also been found to be effective for removing selection bias when the target variable is not related to the selection mechanism (Bethlehem 2010; Rueda 2019).

2.2 Propensity Score Adjustment

Propensity Score Adjustment (PSA) was originally developed by Rosenbaum and Rubin (1983) as a technique for balancing comparison groups in nonrandomised studies, where the inclusion in one group or another might be driven by or associated with variables not controlled by the researchers. PSA was subsequently adapted to the context of online surveys (Taylor 2000; Taylor et al. 2001; Lee 2006; Castro-Martín et al. 2020a) as a means of reducing selection bias when a reference probability sample collected from the same target population is available. In this case, let $s_r$ be the reference sample, $s_v$ the nonprobability sample obtained from the online survey and $s = s_r \cup s_v$. Furthermore, let R be a binary variable measured for U where $R_i = 1$ if $i \in s_v$ and $R_i = 0$ otherwise. PSA assumes that the inclusion probability or propensity score, $\pi $, for $s_v$ is conditional on a set of covariates, ${\mathbf {x}}$, such that:

$$\begin{aligned} \pi _i = P(R_i = 1 | {\mathbf {x}}_i), \ \ \ i \in U \end{aligned}$$

(2)

The inclusion probability can therefore be modelled through a proxy of R. Let z be a binary variable measured for s which $z_i = 1$ if $i \in s_v$ and $z_i = 0$ if $i \in s_r$. The propensity score is then estimated by predicting the values of z using a model M:

$$\begin{aligned} {\hat{\pi }}_i^* = E_M [z = 1 | {\mathbf {x}}_i], \ \ \ i \in s_v \cup s_r \end{aligned}$$

(3)

Note that in this case we are not estimating $\pi $ but $\pi ^*$, which is the propensity obtained when we predict the measured participation z rather than the true participation R.

The propensity scores are used to reweight the nonprobability sample. In this process, inverse probability weighting formulas can be used, such as the simple inverse probability $w^{PSAIPW1} = 1/\pi $ (Valliant 2020) or the inverse probability allowing weights to be less than one, as proposed by Schonlau and Couper (2017): $w^{PSAIPW2} = (1 - \pi )/\pi $. Propensities can also be transformed into weights using the subclassification methods proposed by Lee (2006) and Lee and Valliant (2009). This technique stratifies the vector of propensities into c parts (following Cochran (1968), c is usually taken as 5) with similar propensities, applying the formula:

$$\begin{aligned} w_i^{PSAsub1} = f_c d_i^v = \frac{\sum _{k \in s_r^c} d_k^r / \sum _{k \in s_r} d_k^r}{\sum _{j \in s_v^c} d_i^v / \sum _{j \in s_v} d_i^v} d_i^v \end{aligned}$$

(4)

where $d^r, d^v$ represent the design weights for the reference and volunteer samples respectively and $s_r^c, s_v^c$ are the individuals belonging to the c-th strata of propensities in the reference and volunteer samples respectively. Valliant and Dever (2011) proposed a similar method, but instead of calculating a correction factor,the propensities in each stratum were averaged and then transformed into weights by inverse probability weighting, as follows:

$$\begin{aligned} w_i^{PSAsub2} = \frac{1}{\overline{({\hat{\pi }}_g^*})} \end{aligned}$$

(5)

3 Data

3.1 Artificial data

An experiment with artificial data was performed to evaluate the benefits of feature selection under different conditions. In this experiment, a population U of size $N = 500,000$ was generated with 17 variables: eight variables ${\mathbf {x}} = (x_1, ..., x_8)$ were used as covariates for PSA algorithms, out of which variables $x_1$, $x_3$, $x_5$ and $x_7$ were used as calibration variables. Another eight variables ${\mathbf {y}} = (y_1, ..., y_8)$ were considered as target variables and a variable $\pi $ measured the probability of each individual of the population being selected in the nonprobability sample.

The covariates were generated as described in Eq. 6. Four variables ($x_1$, $x_3$, $x_5$, $x_7$) followed a Bernoulli distribution with $p = 0.5$ and the other four ($x_2$, $x_4$, $x_6$, $x_8$) followed Normal distributions with a standard deviation of one and a mean parameter dependent on the value of the previous Bernoulli variable for each individual; for instance, if the i-th individual had a value of 1 in $x_1$, then its value for $x_2$ was simulated according to a N(2, 1) distribution, and if it had a value of 0, then it was simulated according to a N(0, 1) distribution. This procedure induced a collinearity in the models if all of the covariates were used, an issue that could be addressed by variable selection algorithms.

$$\begin{aligned} \begin{array}{cc} x_{1i}, x_{3i}, x_{5i}, x_{7i} \sim Be(0.5) &{} i \in U \\ &{} \\ x_{ji} \sim N(\mu _{ji}, 1) &{} i \in U, j = 2, 4, 6, 8 \\ &{} \\ \mu _{ji} = \left\{ \begin{array}{cc} 2, &{} \text {if } x_{(j-1)i} = 1 \\ 0, &{} \text {if } x_{(j-1)i} = 0 \end{array} \right.&i \in U, j = 2, 4, 6, 8 \end{array} \end{aligned}$$

(6)

The inclusion probability $\pi $ was made dependent on $x_5, x_6, x_7$ and $x_8$ as described in Eq. 7, which allowed the experiment to cover Missing At Random (MAR) situations.

$$\begin{aligned} ln \left( \frac{\pi _i}{1 - \pi _i} \right) = -0.5 + 2.5(x_{5i} = 1) + \sqrt{2\pi }x_{6i}x_{8i} - 2.5(x_{7i} = 1), \ \ \ i \in U\nonumber \\ \end{aligned}$$

(7)

The target variables were simulated as described in Eqs. 8 to 15. Four types of relationship were considered: no relationship at all with any other variable ($y_1$ and $y_2$), a relationship with the selection mechanism ($y_3$ and $y_4$), a relationship with some covariates related to the selection mechanism ($y_5$ and $y_6$) and a relationship both with the selection mechanism and with some covariates ($y_7$ and $y_8$).

$$\begin{aligned}&y_1 \sim Be(0.5) \end{aligned}$$

(8)

$$\begin{aligned}&y_2 \sim N(10, 1) \end{aligned}$$

(9)

$$\begin{aligned}&y_{3i} \sim Be\left( \frac{exp(\pi _i)}{1 + exp(\pi _i)} \right) , \ \ \ i \in U \end{aligned}$$

(10)

$$\begin{aligned}&y_{4i} \sim N(10, 1) + 5\pi _i, \ \ \ i \in U \end{aligned}$$

(11)

$$\begin{aligned}&y_{5i} \sim Be\left( \frac{exp(0.5 + 0.25(x_{5i} = 1) - 0.25(x_{5i} = 0) + x_{6i})}{1 + exp(0.5 + 0.25(x_{5i} = 1) - 0.25(x_{5i} = 0) + x_{6i})} \right) , \ \ \ i \in U\nonumber \\ \end{aligned}$$

(12)

$$\begin{aligned}&y_{6i} \sim N(10, 1) + 2(x_{5i} = 1) - 2(x_{5i} = 0) + x_{6i}, \ \ \ i \in U \end{aligned}$$

(13)

$$\begin{aligned}&y_{7i} \sim Be\left( \frac{exp(0.5 + 0.25(x_{7i} = 1) - 0.25(x_{7i} = 0) + x_{8i} + \pi _i)}{1 + exp(0.5 + 0.25(x_{7i} = 1) - 0.25(x_{7i} = 0) + x_{8i} + \pi _i)} \right) , \ \ \ i \in U\nonumber \\ \end{aligned}$$

(14)

$$\begin{aligned}&y_{8i} \sim N(10, 1) + 2(x_{7i} = 1) - 2(x_{7i} = 0) + x_{8i} + 5\pi _i, \ \ \ i \in U \end{aligned}$$

(15)

This procedure allowed the target variables to reflect all of the missing data mechanisms; $y_1$ and $y_2$ are examples of Missing Completely At Random (MCAR) data, where the outcome is not related to the selection. $y_5$ and $y_6$ are examples of Missing At Random (MAR) data, where the outcome is indirectly related to the selection through some variables. Finally, $y_3, y_4, y_7$ and $y_8$ are examples of Missing Not At Random (MNAR) data, where the outcome is directly related to the selection mechanism.

3.2 Real data

The experiment was then repeated using a real dataset as a pseudopopulation to examine whether variable selection algorithms might be helpful when more complex relationships are present in the data. The dataset was obtained by the January 2019 Barometer Survey (study number 3238) conducted by the Spanish Centre for Sociological Research (CIS, Spanish initials), a monthly survey that measures political and social opinions among the Spanish adult population (Spanish Center for Sociological Research (2019)). The original dataset of the survey sample made available by the CIS included $n = 2989$ individuals and $p = 203$ variables, out of which 17 variables were finally selected:

6 target variables: assessment of the current economic situation in Spain and in their own lives (binary, 1 if "bad" or "very bad", 0 otherwise), score on the ideological self-positioning scale (numeric, 1–10), assessment of the central government’s performance (binary, 1 if "Poor" or "Very poor", 0 otherwise), territorial organisation preference (binary, 1 if "State with no autonomous structures", 0 otherwise) and national sentiment (binary, 1 if "Self identification as only Spanish", 0 otherwise).
10 variables to be used as covariates in PSA or calibration variables: frequency of attendance at religious acts, gender, age, education level, socioeconomic status, autonomous community of residence, size of the municipality of residence, nationality, marital status and degree to which voting is expected to change things. Gender, age and size of the municipality were chosen as calibration variables in each simulation run, and were also included as potential covariates for PSA.
One variable, use of internet in the three months prior to the survey (1 if it was used, 0 otherwise), was taken as a delimiter of the population subset from which nonprobability samples would be drawn. Individuals with a value of 1, but not those with a value of 0, in this variable could belong to the nonprobability sample. The rationale for this delimiter is that it reproduces the conditions that apply in real online surveys, in which people with no internet access cannot be selected to participate.

The pseudopopulation was obtained by bootstrapping the original sample up to $N = 500,000$ individuals through simple random sampling with replacement. Out of the 500,000 individuals, 404,174 ($80.83\%$ of the pseudopopulation) had used the internet in the three months prior to the survey. Despite internet’s large penentration, the differences between the population with and without access to the internet are noticeable in several target variables, which leads to a remarkable amount of coverage bias when estimating population parameters using only people who had accessed the internet. This coverage bias can be treated using calibration in addition to PSA. Those differences can be observed in Table 1.

Table 1 Population values of the variables of interest in the real data simulation. All numbers are population proportions of the features of interest described in the "Target variable" column, except for "Ideological self-positioning scale (1–10)" where the numbers correspond to the population mean

Variable selection in Propensity Score Adjustment to mitigate selection bias in online surveys

Abstract

Similar content being viewed by others

Design-Unbiased Statistical Learning in Survey Sampling

Indicators for Monitoring the Survey Data Quality When Non-response or a Convenience Sample Occurs

Sample selection bias with multiple dependent selection rules: an application to survey data analysis with multilevel nonresponse

1 Introduction

2 Adjustments for nonprobability samples

2.1 Calibration

2.2 Propensity Score Adjustment

3 Data

3.1 Artificial data

3.2 Real data

4 Methods

4.1 Feature selection algorithms

4.2 Estimation with Propensity Score Adjustment and calibration

4.3 Experiment settings

5 Results

5.1 Artificial data

5.2 Real data

6 Application study

7 Discussion and conclusions

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation