Variable selection in Propensity Score Adjustment to mitigate selection bias in online surveys

The development of new survey data collection methods such as online surveys has been particularly advantageous for social studies in terms of reduced costs, immediacy and enhanced questionnaire possibilities. However, many such methods are strongly affected by selection bias, leading to unreliable estimates. Calibration and Propensity Score Adjustment (PSA) have been proposed as methods to remove selection bias in online nonprobability surveys. Calibration requires population totals to be known for the auxiliary variables used in the procedure, while PSA estimates the volunteering propensity of an individual using predictive modelling. The variables included in these models must be carefully selected in order to maximise the accuracy of the final estimates. This study presents an application, using synthetic and real data, of variable selection techniques developed for knowledge discovery in data to choose the best subset of variables for propensity estimation. We also compare the performance of PSA using different classification algorithms, after which calibration is applied. We also present an application of this methodology in a real-world situation, using it to obtain estimates of population parameters. The results obtained show that variable selection using appropriate methods can provide less biased and more efficient estimates than using all available covariates.


Introduction
In recent years, online surveys have undergone rapid development in a wide variety of fields, including public opinion research (Couper 2000) and life sciences (Thornton et al. 2016;Borodovsky et al. 2018). In contrast to traditional survey modes, which are experiencing issues with response rates (according to Marken (2018), response rates in Gallup Poll Social Series dropped from 28% in 1997 to 7% in 2017) and increasing costs, online surveys offer a faster and cheaper method to measure certain features in individuals. In addition, there is an increasing availability of large sets of data obtained from the Web with automatic procedures (such as web scraping or APIs) that are often used for inference in finite populations.
Non-probabilistic surveys provide us with some advantages over traditional methods but also these surveys have given us many research problems: many problems demand the type of velocity for both data processing and analysis, but the most important is related to the quality of the data. Data quality is far more important than data quantity. Meng (2018) studies some theoretical aspects related to the impact of the application of non-probability online surveys on the estimation quality, and develops a data defect index, a main topic of population inferences from Big Data and concludes it is more important to reduce sampling and non-response biases than non-response rates.
Indeed, non-probabilistic surveys emphasise certain types of nonsampling errors. It is not feasible to obtain a representative sampling frame of the online population except in specific situations where the target population is a well-characterised group (such as company employees or university students each of whom is associated with an e-mail address). For this reason, most online surveys or large volume datasets are based on volunteer samples. In addition, the coverage of this approach is limited by the extent of Internet penetration among the population, which is often subject to demographic characteristics. For instance, according to the Spanish Survey on Equipment and Use of Information and Communication Technologies in Households (National Institute of Statistics of Spain 2018), while 98.5% of the Spanish population aged 16-24 years make regular use of the Internet, only 49.1% of those aged 65-74 years do so. Although the difference has narrowed in the last few years, online surveys are still unable to provide representative samples except when special procedures are used, such as offline recruitment, panels or mixed modes (see Schonlau and Couper 2017 for a review of the available options).
The lack of a probability sampling scheme might lead to significant differences between sampled and nonsampled individuals, which constitutes a selection bias that cannot be redressed with the usual procedures (Elliott and Valliant 2017). Selection bias is a particularly important concern in online surveys because of their intrinsic characteristics (Couper 2000). Statistical adjustments are crucial to obtaining reliable estimates from online survey data; in this context, calibration or Propensity Score Adjustment (PSA) can be used, according to the kind of auxiliary information available. While calibration only needs the vector of population totals for some auxiliary covariates, PSA requires a probability sample drawn from the same target population, even when the nonprobability sample is drawn from a subset of it, which is the case of Internet surveys (not everybody may have access to the Internet in a given popu-lation) and imperfect sampling frames in general. This sample is used to estimate the (unknown) participation propensities for the individuals in the nonprobability sample through prediction models. These estimated propensities can be used as inclusion probabilities to build weights for different parametric estimators.
The efficacy of PSA at removing selection bias has been proved, although some considerations should be taken into account. First, PSA is strongly dependent on the covariates used to estimate the propensities. Lee (2006) showed that the use of covariates which are strongly related to the variables of interest in PSA models achieves greater reductions in bias than is the case with nonsignificant variables. Second, further adjustments such as calibration procedures must be applied in order to maximise the effectiveness of PSA (Lee and Valliant 2009;Valliant and Dever 2011;Valliant 2020). Finally, the use of PSA is associated with an increase in the variance of the estimates.
In this study, we focus on the first point raised above: the choice of covariates. Lee (2006) suggested that including all available covariates, as recommended by Rubin and Thomas (1996), might be a reasonable practice. However, statistical models based on modern classification techniques such as Machine Learning algorithms might benefit from feature selection to reduce the complexity of the models (and the variance of their predictions). Variable inclusion in propensity models for treatment weighting has been widely studied (Hirano and Imbens 2001;Brookhart et al. 2006;Austin 2008;Schneeweiss et al. 2009;Austin 2011;Myers et al. 2011;Patrick et al. 2011;Austin and Stuart 2015) and variables are often selected using a stepwise algorithm or they are assessed prior to the study according to their known relationship to the outcome or exposure variables. In this case, better results are obtained when the variables in question are related to the outcome variables or to both the outcome and the exposure variables.
In many real-world applications, there may be very little information about the preexisting relationships between variables, which increases the difficulty of selecting the best subset of variables for propensity estimation. In the present study, we consider how modern techniques of feature selection (or variable selection) developed for knowledge discovery in data can be used in propensity estimation modelling. These techniques only require an appropriate dataset from which to locate the variables more closely related to a given target variable or that may be more influential with respect to predicted values, according to the behaviour observed in the dataset. The benefits of feature selection, in terms of increased accuracy and reduced computational costs, have been demonstrated in classification tasks (Bolón-Canedo et al. 2013;Xue et al. 2015).
In survey research, feature selection has been studied with respect to the problem of calibration when a large number of variables must be considered. Breidt and Opsomer (2017) reviewed this question and suggested that auxiliary variables for calibration may be too closely correlated or have poor predictive power, and therefore model selection should be employed to improve the estimates obtained and to stabilise the weights. Stepwise and best subsets algorithms have been considered for this purpose, but models from the class of "least absolute shrinkage and selection operator" (LASSO), which perform feature selection by shrinking regression coefficients to zero in non-informative variables, seem to be the most promising methods to improve the weighting. Their efficiency in non-probability samples was highlighted by Chen et al. (2019), who showed that LASSO-weighted estimators have a lower RMSE than PSA-weighted equivalents.
The rest of this paper is organised as follows: Sect. 2 presents the essential aspects of calibration and PSA. The synthetic data and the real survey datasets used in our experiments are then described in Sect. 3. In Sect. 4 we describe the deployment of PSA models with a grid of classifiers and feature selection algorithms for the study data. The results of the experiments in terms of relative bias and efficiency are detailed in Sect. 5, after which the method proposed is applied in a real-world context concerning addiction and dependence, in Sect. 6. Finally, the implications of our findings are discussed in Sect. 7.

Calibration
Calibration was developed by Deville and Särndal (1992) as a reweighting method based on the availability of population totals for auxiliary variables measured in a sample, although some later versions addressed missing data situations or the use of dual frames for survey sampling (Ranalli et al. 2016). This adjustment is intended to reduce the coverage error between the target population and the sample, and takes the following form. Let x be a n × p matrix of p variables measured in a sample of size n, x i j is the value of the i-th individual in the j-th auxiliary variable, X = (X 1 , ..., X j , ..., X p ) are the known population totals for the auxiliary variables and d = (d 1 , ..., d i , ..., d n ) is the vector of design weights of the sample. If a probabilistic unbiased sample from the same population is available, estimated population totals can be used for X as an alternative (see Ferri-García and Rueda 2018 for a study of its efficiency). Calibration then attempts to obtain a new vector of weights w = w 1 , ..., w i , ..., w n by minimising their distance frp, d (from a class of distances leading to different estimators) subject to the calibration equations: n k=1 w k x k j = X j , j = 1, ..., p When information on population totals is incomplete, and especially when the crossclassification totals (also known as cell counts) are not known, it can be useful to use the raking ratio as defined in Deville et al. (1993), which takes advantage of the estimation of cell counts from the available data in the sample. Here, letN ab = k/x Ak =a,x Bk =b d k be the estimated cell count of ab, which represents the number of individuals whose measured value in the variables A and B is a and b respectively. The raking ratio uses this information to reformulate the calibration equations, thus obtaining the calibrated weights w k = d kN w ab /N ab , whereN w ab = d kNab represents the calibrated estimations of the cell counts. The efficiency of calibration procedures depends on the relevance of the auxiliary information in terms of relationship with the target variable and on the mechanism producing the coverage error. Calibration has also been found to be effective for removing selection bias when the target variable is not related to the selection mechanism (Bethlehem 2010; Rueda 2019).

Propensity Score Adjustment
Propensity Score Adjustment (PSA) was originally developed by Rosenbaum and Rubin (1983) as a technique for balancing comparison groups in nonrandomised studies, where the inclusion in one group or another might be driven by or associated with variables not controlled by the researchers. PSA was subsequently adapted to the context of online surveys (Taylor 2000;Taylor et al. 2001;Lee 2006;Castro-Martín et al. 2020a) as a means of reducing selection bias when a reference probability sample collected from the same target population is available. In this case, let s r be the reference sample, s v the nonprobability sample obtained from the online survey and s = s r ∪ s v . Furthermore, let R be a binary variable measured for U where R i = 1 if i ∈ s v and R i = 0 otherwise. PSA assumes that the inclusion probability or propensity score, π , for s v is conditional on a set of covariates, x, such that: The inclusion probability can therefore be modelled through a proxy of R. Let z be a binary variable measured for s which z i = 1 if i ∈ s v and z i = 0 if i ∈ s r . The propensity score is then estimated by predicting the values of z using a model M: Note that in this case we are not estimating π but π * , which is the propensity obtained when we predict the measured participation z rather than the true participation R. The propensity scores are used to reweight the nonprobability sample. In this process, inverse probability weighting formulas can be used, such as the simple inverse probability w P S AI PW 1 = 1/π (Valliant 2020) or the inverse probability allowing weights to be less than one, as proposed by Schonlau and Couper (2017): w P S AI PW 2 = (1 − π)/π. Propensities can also be transformed into weights using the subclassification methods proposed by Lee (2006) and Lee and Valliant (2009). This technique stratifies the vector of propensities into c parts (following Cochran (1968), c is usually taken as 5) with similar propensities, applying the formula: where d r , d v represent the design weights for the reference and volunteer samples respectively and s c r , s c v are the individuals belonging to the c-th strata of propensities in the reference and volunteer samples respectively. Valliant and Dever (2011) proposed a similar method, but instead of calculating a correction factor,the propensities in each stratum were averaged and then transformed into weights by inverse probability weighting, as follows:

Artificial data
An experiment with artificial data was performed to evaluate the benefits of feature selection under different conditions. In this experiment, a population U of size N = 500, 000 was generated with 17 variables: eight variables x = (x 1 , ..., x 8 ) were used as covariates for PSA algorithms, out of which variables x 1 , x 3 , x 5 and x 7 were used as calibration variables. Another eight variables y = (y 1 , ..., y 8 ) were considered as target variables and a variable π measured the probability of each individual of the population being selected in the nonprobability sample. The covariates were generated as described in Eq. 6. Four variables (x 1 , x 3 , x 5 , x 7 ) followed a Bernoulli distribution with p = 0.5 and the other four (x 2 , x 4 , x 6 , x 8 ) followed Normal distributions with a standard deviation of one and a mean parameter dependent on the value of the previous Bernoulli variable for each individual; for instance, if the i-th individual had a value of 1 in x 1 , then its value for x 2 was simulated according to a N (2, 1) distribution, and if it had a value of 0, then it was simulated according to a N (0, 1) distribution. This procedure induced a collinearity in the models if all of the covariates were used, an issue that could be addressed by variable selection algorithms.
The inclusion probability π was made dependent on x 5 , x 6 , x 7 and x 8 as described in Eq. 7, which allowed the experiment to cover Missing At Random (MAR) situations.
The target variables were simulated as described in Eqs. 8 to 15. Four types of relationship were considered: no relationship at all with any other variable (y 1 and y 2 ), a relationship with the selection mechanism (y 3 and y 4 ), a relationship with some covariates related to the selection mechanism (y 5 and y 6 ) and a relationship both with the selection mechanism and with some covariates (y 7 and y 8 ).
This procedure allowed the target variables to reflect all of the missing data mechanisms; y 1 and y 2 are examples of Missing Completely At Random (MCAR) data, where the outcome is not related to the selection. y 5 and y 6 are examples of Missing At Random (MAR) data, where the outcome is indirectly related to the selection through some variables. Finally, y 3 , y 4 , y 7 and y 8 are examples of Missing Not At Random (MNAR) data, where the outcome is directly related to the selection mechanism.

Real data
The experiment was then repeated using a real dataset as a pseudopopulation to examine whether variable selection algorithms might be helpful when more complex relationships are present in the data. The dataset was obtained by the January 2019 Barometer Survey (study number 3238) conducted by the Spanish Centre for Sociological Research (CIS, Spanish initials), a monthly survey that measures political and social opinions among the Spanish adult population (Spanish Center for Sociological Research (2019)). The original dataset of the survey sample made available by the CIS included n = 2989 individuals and p = 203 variables, out of which 17 variables were finally selected: -6 target variables: assessment of the current economic situation in Spain and in their own lives (binary, 1 if "bad" or "very bad", 0 otherwise), score on the ideological self-positioning scale (numeric, 1-10), assessment of the central government's performance (binary, 1 if "Poor" or "Very poor", 0 otherwise), territorial organisation preference (binary, 1 if "State with no autonomous structures", 0 otherwise) and national sentiment (binary, 1 if "Self identification as only Spanish", 0 otherwise). -10 variables to be used as covariates in PSA or calibration variables: frequency of attendance at religious acts, gender, age, education level, socioeconomic status, autonomous community of residence, size of the municipality of residence, nationality, marital status and degree to which voting is expected to change things.
Gender, age and size of the municipality were chosen as calibration variables in each simulation run, and were also included as potential covariates for PSA. -One variable, use of internet in the three months prior to the survey (1 if it was used, 0 otherwise), was taken as a delimiter of the population subset from which nonprobability samples would be drawn. Individuals with a value of 1, but not those with a value of 0, in this variable could belong to the nonprobability sample. The rationale for this delimiter is that it reproduces the conditions that apply in real online surveys, in which people with no internet access cannot be selected to participate.
The pseudopopulation was obtained by bootstrapping the original sample up to N = 500, 000 individuals through simple random sampling with replacement. Out of the 500,000 individuals, 404,174 (80.83% of the pseudopopulation) had used the internet in the three months prior to the survey. Despite internet's large penentration, the differences between the population with and without access to the internet are noticeable in several target variables, which leads to a remarkable amount of coverage bias when estimating population parameters using only people who had accessed the internet. This coverage bias can be treated using calibration in addition to PSA. Those differences can be observed in Table 1.
The pseudopopulation was obtained by bootstrapping the original sample up to N = 500, 000 individuals through simple random sampling with replacement. Prior to the bootstrapping, anyone who did not answer ("Does not know"/"Does not answer") any of the 17 items was excluded, as were the persons who answered "Other" for education level, or who gave "Ceuta" or "Melilla" as their autonomous community of residence. The reason for this filtering process was to remove highly uncommon classes that could produce inconsistencies in a simulated sample and provoke errors in the propensity scoring algorithms. Moreover, the education levels "No formal education" and "Primary education" were collapsed into a single class, while missing data in the variable concerning attendance at religious acts was taken as a new class (given that everyone in this group was considered to be atheist or agnostic). After the preprocessing, the sample size before bootstrapping was n = 2, 156.

Feature selection algorithms
Feature selection was performed prior to PSA to select those variables that are more relevant for the prediction of a target variable (which must be present in the dataset). In a model-based framework, we only have one variable of interest, y, for which we both predict its values and estimate its population parameters. However, in the designbased framework of PSA, we have two variables that must be considered: the indicator variable z (z i = 1 if i ∈ s v , z i = 0 otherwise), for which we predict the probability P(z = 1), and the target variable of the study, y, and its population parameters of interest that we want to estimate.

Table 1
Population values of the variables of interest in the real data simulation. All numbers are population proportions of the features of interest described in the "Target variable" column, except for "Ideological self-positioning scale (1-10)" where the numbers correspond to the population mean Target  Internet population: population who accessed the internet in the three months prior to the survey. Non-internet population: population who did not accessed the internet in the three months prior to the survey. Absolute difference: |Complete -Internet|. Relative difference: Complete/Internet -1 Given that the predictive model is applied on z in PSA, it would be fair to assume that the relevant variables for prediction should be selected considering z as the target variable in the feature selection algorithms. However, given the previous research on PSA for experimental designs and the fact that the bias must be removed for y (and not necessarily for z), it could also be reasonable to select those variables which are more relevant to y, and therefore to consider y as the target variable in the feature selection algorithms. In this work, we have considered both scenarios: feature selection for the prediction of z, and feature selection for the prediction of y. In the former case, the combination of both samples s v ∪ s r can be used, while in the latter case only s v can be used for feature selection, as y has only been measured for individuals in s v .
The following feature selection algorithms were used in the experiment, and their performance was compared to the use of all variables and to the use of the variables provided by Stepwise: -CFS (Correlation-based Feature Selection) filter with best first search. This algorithm, proposed by Hall (1999), searches the subset of variables which maximises the correlation with the target variable and minimises that between the variables of the subset. Thus, irrelevant and redundant features are discarded from the optimal subset of features for prediction. Note that Pearson's correlation is used to evaluate the relationships between the variables; if any of the variables within a pair is nonnumeric, it is binarised and each of the binary variables is then used separately. The simplicity of the algorithm makes it a fast and intuitive choice for finding the optimal subset of relevant covariates, while at the same time addressing the multicolinearity problems that may arise from focusing only in the correlation between the target variable and the covariates. On the other hand, the use of Pearson's correlation coefficient implies that nonlinear relationships or categorical variables with high cardinality might be erroneously discarded by the algorithm. -Chi-square filter. This approach calculates the Cramer's V value between the target variable and each independent variable, and so the user must define a cut-off point for selection. In our experiment, the cut-off point was the Cramer's V value with the biggest difference from the V of the next variable in importance (ordered from highest to lowest). This filter performs better at finding relationships between categorical variables rather than continuous ones, given that Cramer's V depends on the number of classes of each pair of variables, and therefore the coefficient might be considerably more sensitive towards continuous variables. -Gain ratio. This entropy-based filter (Quinlan 1986) is calculated by dividing the information gain by the entropy of the target variable. The information gain is measured as the difference between the sum of the entropies of the independent and the target variables and the entropy of the target variable after introducing the independent variable into the predictive model (defined as a decision tree). The gain ratio, thus, is a relative continuous measure of the predictive performance of a variable. The cut-off point was the gain ratio's value with the biggest difference from the gain ratio of the next variable in importance (ordered from highest to lowest). Gain ratio, as the rest of entropy-based filters, is assumed to identify nonlinear relationships more precisely than correlation coefficients, but require discretization of continuous variables in order to seize all its potential (Yu and Liu 2003). -One-R. This algorithm, developed by Holte (1993), is based on very simple rules of association, by which each independent variable is tabulated with the target variable. The number of errors is then determined and interpreted such that higher values represent a stronger predictive power. OneR automatically divides continuous variables into categories using discretization functions, hence it is a more suitable filter in datasets with categorical and continuous variables. However, the algorithm is prone to overfitting if the discretization aims to obtain "pure" classes where all individuals take the same values. -Random Forest importance filter. This algorithm computes the mean importance value across the trees created in a Random Forest model (Breiman 2001) for each independent variable. In our experiment, the importance value taken was the mean decrease in accuracy when the variable was discarded from the Random Forest model. The cut-off point was the importance value with the biggest difference from the importance value of the next variable (ordered from highest to lowest according to their importance value). This algorithm is suitable for any kind of target variable and covariates, and its bagging configuration can be advantageous in more complex situations. In addition, the use of the mean decrease in accuracy instead of node impurity (measured with the Gini index) avoids the overestimation of the importance value of continuous attributes. However, the method is still sensitive to multicolinearity problems (Nicodemus et al. 2010). -Boruta algorithm. This algorithm is based on the Random Forest importance measure, but it considers a set of non-informative variables created from the random shuffling of each independent variable included in the model. As a result, the algorithm selects the variables that have greater importance than non-informative variables. To obtain statistically valid results, the procedure is repeated until every variable has been deemed as "important" or "unimportant". Further details on this algorithm can be consulted in Kursa and Rudnicki (2010). This algorithm can be considered an improved version of the Random Forest importance filter and has the advantage of automatically selecting the relevant variables and discarding the irrelevant ones. However, the computational costs of the algorithm are large. -LASSO regression (Tibshirani 1996). This regression model performs a variable selection based on introducing penalisation terms into the Ordinary Least Squares equations. As a result, a regression model is provided but only the variables selected have non-zero coefficients. In the present study, we take advantage of the LASSO variable selection technique by extracting the variables with non-zero coefficients and using them as inputs for the propensity estimation models. When all the coefficients of the LASSO model are zero, no PSA is performed and therefore the weights remain unitary. The LASSO algorithm has provided better results than other predictive methods in nonprobability sampling contexts (Chen et al. 2019;Castro-Martín et al. 2020a), but the variable selection is performed considering a specific model and optimization criteria. The advantages on the use of subsets of relevant variables according to LASSO as input variables for other predictive algorithms are unclear.

Estimation with Propensity Score Adjustment and calibration
Once the optimal subset of variables had been selected, the Propensity Score Adjustment (PSA) was performed. As well as logistic regression (LR), the standard algorithm in PSA, several other algorithms were also tested for propensity estimation, namely: k-Nearest Neighbours (kNN), Gradient Boosting Machine (GBM) and feed-forward neural networks (NN). Parameter tuning was performed for these three algorithms. Ten-fold cross-validation was applied to the model, predicting z prior to PSA; the following parameter grids were used for each algorithm: The choice of the kNN and GBM algorithms is based on their performance in the previous study developed in Ferri-García and Rueda (2020), where they showed a better performance than logistic regression in some situations in terms of bias and MSE. The use of neural networks is based on providing more diversity in the approaches, and the possible predictive advantage than neural networks may provide in modeling (Breidt and Opsomer 2017). k-Nearest Neighbors is a simple algorithm that can provide good results when the number of covariates is low, while its performance decreases in contexts of high dimensionality. For this reason, variable selection might be highly recommendable to boost kNN performance. Gradient Boosting Machines are better in those high-dimensionality situations because of their boosting algorithm that is able to internally select the best predictors.

Experiment settings
In both scenarios, the same procedure was followed to measure the effects of variable selection in PSA and calibration on the estimation from nonprobability samples. This procedure, repeated across 400 simulation runs for each dataset (artificial and real), can be sequentially described as follows: 1. Two samples of size n = 1,000 are drawn. The first one, s r , is the probability sample and is drawn by simple random sampling without replacement (SRSWOR) from the full population. The second sample, s v , is the nonprobability sample and is drawn according to the following schemes: -Artificial dataset: unequal probability sampling where π is the vector of inclusion probabilities, calculated as described in Equation 7. -Real dataset (two schemes): -SRSWOR from the subset of the population who had accessed the internet during the three months prior to the survey. -Unequal probability sampling from the subset of the population who had accessed the internet during the three months prior to the survey, with inclusion probabilities proportional to the age: where U I is the subset of the pseudopopulation who used the Internet in the three months prior to the survey.
2. Propensity of belonging to s v is estimated with PSA, using the variable selection algorithms described in Sect. 4.1 to select the input covariates for propensity prediction models, and the four choices of algorithms described in Sect. 4.2 to model propensities. We also consider the choice where no variable selection algorithm is applied and all covariates are included in the models. 3. Estimated propensities are transformed into weights using the inverse probability weighting formula w i = 1/π i . 4. Weights are used to estimate the population mean of each target variable with and without applying Raking calibration, on which the propensity weights w obtained in step 3 are used as initial weights.
The resulting 400 estimates of the population mean for each combination of methods are subsequently used to obtain the relative bias of a given combination of methods: where Y is the population mean of the target variable, andŷ i is the estimate of the population mean of the i-th simulation obtained after applying bias reduction methods. Together with the relative bias, the efficiency of each variable selection method with respect to the case in which all variables are used is also shown, given a propensity model m (Log. reg., GBM, kNN or NN), a Raking calibration choice (yes or no) r , and a choice for the target variable (exposure or outcome) in selection algorithms v: where k = {Boruta, CFS, Chi-squared, Gain ratio, LASSO, StepWise, OneR, Random Forest importance} is the variable selection algorithm and MSE is the Mean Squared Error observed for the combination of methods: An effect greater than 1 means that the use of the variable selection method k is inefficient in comparison with using all covariates, while if it remains below 1 the selector k provides more efficient estimates, provided all other adjustments remain equal. This "Effect" can be seen as the MSE of a certain variable selection method in relation to the reference case in which all variables are used. The statistical significance of each effect was tested using bootstrapping techniques. Basic resampling (with 1000 replications) was performed to obtain each effect number, so the standard deviation of the effect could be estimated from the bootstrapped samples. The standard deviation was used to perform t-tests on the following null hypothesis: The t-tests used the standard deviation calculated from the bootstrap procedure, and the confidence level was fixed at 95% for all effects. Rejection of the null hypothesis would mean that there are statistical evidences that the variable selection method k provides more efficient estimates given that the rest of conditions remain unchanged. Resampling was performed using the resample package available in R (Hesterberg 2015).
Finally, for each feature selection algorithm and Raking choice (Raking used or not used after PSA), we computed the estimated mean and median relative bias and effect. For relative bias, we also computed the number of times that the estimates provided by a feature selection algorithm have been among the best (Relative Bias less than 1% greater than the minimum) conditional to a given variable of interest, target variable in the feature selection algorith, PSA predictive algorithm and Raking strategy. For the effect, we computed the number (and percentage) of times that the effect has been below 1 (the MSE after applying a given feature selection algorithm was lower than the MSE using all variables) and below 0.9 (the MSE after applying a given feature selection algorithm was more than 10% lower than the MSE using all variables).

Artificial data
The relative bias results obtained in the simulation with artificial data are shown in Tables 8 and 9. For the MCAR variable y 1 , variable selection was useful when neural nets were used as the predictive model and Raking was applied after PSA, although the improvements were not dramatic. The least biased estimates were provided by PSA with kNN using all variables in y 1 and variables selected by OneR (selecting on the variable of interest) with neural nets and no Raking in y 2 , although this result was closely followed by the Gain ratio score in the latter case. However, the differences are too small to be considered relevant.
With the MAR variables (y 5 and y 6 ), Raking calibration markedly reduced the bias in the estimates. Regarding variable selection, almost all the methods in y 5 and some of them in y 6 reduced the bias when the predictive model was logis-tic regression, although some reductions were also observed when other methods were applied in different models. In the case of y 6 , the chi-square filter, the Gain Ratio and Random Forest all reduced the bias from 2.88 (when using all available covariates) to 2.01 in y 2 if logistic regression and Raking calibration were applied.
Finally, in NMAR situations (y 3 , y 4 , y 7 and y 8 ), the application of Raking calibration also reduced bias but not as much as for MAR variables. For y 3 and y 8 , the best choice for the target variable in the selection algorithms was the variable of interest (y), while fixing the target variable in the indicator variable of inclusion in s v (z) provided better results in y 4 . The largest reductions in bias in y 3 were obtained with the LASSO algorithm, although CFS, Chi-square and the Gain Ratio also worked well when combined with Raking.
The effect of each variable selection method in comparison to using all variables, if the rest of methods remain equal, is detailed in Tables 10 and 11. These results are in line with those for relative bias in each case, although they reflect some improvement in a much larger set of situations. With the MCAR variables (y 1 and y 2 ), MSE could be slightly reduced using variable selection in some cases, but in general the effect improvements were small. Reductions were always below 10% of the MSE using all variables, and only 7.03% of the efficiencies were below 0.95 (this is, reductions above 5% of the baseline MSE), all of them achieved selecting variables with regard to the variable of interest.
Regarding the MAR variables (y 5 and y 6 ), very noticeable improvements in efficiency were obtained in the estimation of y 6 . When Raking calibration was applied, the use of the Chi-square filter, Gain Ratio or Random Forest reduced MSE by 11% (if k-NN was used in PSA) to 50% (if LR was used in PSA), when they selected variables using the variable of interest as the target. Reductions in MSE with the same variable selection methods were also observed when Raking was not applied. Other methods, too, provided larger effect values when y 5 was estimated for the cases in which logistic regression was used to estimate the propensities. It is worth noting that the hypothesis of greater efficiency of selection methods was accepted in several cases, which gives some evidence that variable selection methods can work in practice.
Finally, regarding the NMAR variables, the reductions in MSE were noticeable: 12.5%, 11.7% and 24.2% of the times the effect was below 0.9 (improvements in the MSE above 10%) in the estimation of y 3 , y 7 and y 8 respectively. In the estimation of y 3 , all the combinations of methods, except for those involving Boruta selection algorithm, provided efficiencies significantly lower than 1 (with a mean effect of 0.922) when selecting variables using the variable of interest as the target in feature selection algorithms. The opposite situation was observed in y 4 : selecting variables using the indicator variable of inclusion in s v as the target provided better results, although the improvements were considerably lower despite many of them being statistically significant. In y 7 , feature selection algorithms provided more efficient estimates mainly in those cases where logistic regression was used, although OneR (selecting on the variable of interest) provided gains when using other algorithms for PSA in the cases where Raking was not applied. For y 8 , there are statistical evidences that all of the feature selection algorithms provide more efficient estimates (except for StepWise and LASSO) when the selection is done using the variable of interest as the target, regardless of the classifier used in PSA or the further use of Raking calibration.
A summary of the relative bias and effect results observed in the artificial data simulation can be found in Table 2. When Raking calibration is not applied, results on relative bias are very similar across variable selection strategies, although the most featured method among the best ones is the strategy of using all variables. When Raking calibration is applied, the situation slightly changes, with the gain ratio score appearing the most in the set of algorithms that provide the best results, followed by the OneR algorithm. These results can be explained by the fact that each Bernoulli covariate is related to a Gaussian covariate, but the different Bernoulli and Gaussian covariates are independent of each other. This scenario can be more suitable for simple algorithms that test one variable at a time, such as OneR, because it is only necessary to retain one of the two variables that compose each Bernoulli-Gaussian relationship, while in more complex scenarios the choice would not be that clear. Regarding the effect, none of the variable selection algorithms seem to outperform the case where all variables are used when Raking calibration has not been applied. If it has been applied, there is a certain evidence towards the effectiveness of variable selection, with all of the algorithms except for StepWise and Boruta achieving lower MSE values than the case where all variables are used more than a half of the times they are included in the preprocessing step.

Real data
The relative bias results obtained by each combination of methods in the simulation using CIS data and considering SRSWOR from the Internet population to obtain the nonprobability samples are listed in Table 12. Interestingly, the best choice in variable selection differed according to the propensity estimation model considered. For example, PSA using k-NN provided the best results when using all the available covariates, except for the variable measuring central government performance. In the remaining cases, the use of certain variable selection algorithms was associated with a decrease in relative bias. This was especially apparent for the variables measuring the economic situation in Spain, central government management and the preference for a unitary national state without autonomous communities. In these cases, the largest reductions in relative bias (compared to the case in which all variables were used) were obtained when the variable selection algorithms used the variable of interest as the target variable. Raking calibration had a modest positive effect on the variables measuring the ideological self-positioning scale, the preference for a unitary national state without autonomous communities and whether the respondent self identified as only Spanish, while its impact on relative bias in the other variables was non-relevant or negative. The efficiency of each variable selection algorithm for a given combination of adjustments (propensity model, use of calibration and target variable choice for selection), in comparison with the case in which all variables are used, is shown in Table 13. For all variables, one or more selection algorithms increased the efficiency, in comparison Table 2 Estimated mean and median of RB and Effect of estimates using PSA for each algorithm, and number of times its estimates have been among the best (RB less than 1% greater than the minimum) or have been more efficient than using all variables (Effect under 1 and under 0.9) in the artificial data simulation Use of with the case in which all variables were used. The estimated mean and median effect of all the possible combinations of methods shown in Table 13 is below 1 for all the target variables except for the ideological self-positioning scale. However, the numbers are strongly dependent on the target fixed in the feature selection algorithm. For instance, the percentage of combinations with an effect below 0.9 (decrease of more than 10% in the MSE in comparison to the case where all covariates are used) when fixing the variable measuring the inclusion in s v (z) as the target is 0% when estimating the economic situation in Spain, central government management and the preference for a unitary national state without autonomous communities, but if the variable of interest is fixed as the target in feature selection algorithms, the percentage rises up to 12.5%, 29.7% and 20.3% respectively. On the other hand, feature selection provides better results in the estimation of ideological self-positioning scale scores when fixing the variable measuring the inclusion in s v (z) as the target (estimated median effect: 0.942; percentage of combinations with an effect below 0.9: 20.3%). Some statistical evidences can be observed regarding the effectiveness of several methods, such as the chi-square filter, LASSO and OneR for PSA with logistic regression when estimating the economic situation in Spain, or CFS and Gain ratio when estimating the ideological self-positioning scale mean score, among other examples. A summary of the relative bias and effect results can be found in Table 3. When Raking calibration is not applied, results on relative bias are very similar across variable selection strategies, although the most featured method among the best ones is CFS, followed by using all variables. When Raking calibration is applied, the situation slightly changes, with OneR appearing the most in the set of algorithms that provide the best results, followed by CFS. Regarding effect, it can be observed that all the variable selection algorithms (except for StepWise and Boruta when applying Raking calibration) provide more efficient estimates in more than half of the cases, a percentage that goes above 60% and even 80% of the cases for CFS and OneR when Raking is applied. In both situations mentioned, along with the case of chi-square filter, the percentage of cases where the effect is below 0.9 (reduction of the MSE above 10% in comparison to the case where all covariates are used) is almost 20% (18.8%). These results, along with the statistical evidence observed in hypothesis testing, suggest an advantage of variable selection methods in comparison to the use of all the available covariates.
The relative bias results obtained by each combination of methods in the simulation using CIS data and considering probabilities proportional to the age to obtain the nonprobability samples from the Internet population are listed in Table 14. It is worth noting that the behavior of relative bias changes for some variables; the non-raked estimates of the personal economic situation are less biased than the case where the nonprobability sample is obtained via SRSWOR from the Internet population, but more biased when estimating the rest of the variables of interest. On the other hand, feature selection algorithms offer a very similar performance to the previous case, providing less biased estimates in a variety of scenarios. The largest reductions in relative bias (compared to the case in which all variables were used) were again obtained when the variable selection algorithms used the variable of interest as the target variable, especially (but not exclusively) if Raking calibration was applied. The effect of Table 3 Estimated mean and median of Relative Bias and Effect of estimates using PSA for each algorithm, and number of times its estimates have been among the best (Relative Bias less than 1% greater than the minimum) or have been more efficient than using all variables (Effect under 1 and under 0.9) in the real (bootstrapped) Table 15. In general, the effect was noticeably lower in this scenario in comparison to the previous one (SRSWOR from the Internet population to obtain the nonprobability samples). The percentage of combinations that provided effects below 0.9 was more than 10% in the estimation all the variables of interest, and the estimated median was below 1 except for ideological self-positioning scale and feeling only Spanish variables. In those cases, it can be shown that the effect is largely different depending on the choice for the target variable in the feature selection algorithm; when estimating ideological self-positioning scale, it is better to fix the indicator variable of inclusion in s v as the target (estimated median effect: 0.933 against 1.37 when using the variable of interest), and vice versa for feeling only Spanish (estimated median effect: 0.990 against 1.02 when using the indicator variable of inclusion in s v ). In addition, the number of statistically significant results for the effect is large, up to the point that for each variable of interest and Raking choice (except personal economic situation with no Raking, and feeling only Spanish with Raking) there is a feature selection algorithm that has a positive effect on estimators' efficiency. The aforementioned results are summarized in Table 4. When Raking calibration is not applied, the estimated mean and median relative bias observed is smaller for certain feature selection methods, more precisely: chi-square filter, OneR and Random Forest importance. These algorithms are also the ones that appear the most among the best approaches for feature selection (in terms of relative bias). The high performance of these algorithms remains in the case where Raking calibration is applied. Regarding effect, it can be noticed that chi-squared filter, OneR, Gain ratio and Random Forest importance are the ones that provide the best results, providing efficient estimates around 70% of the times or even more than 80% sometimes, while other algorithms also seem to offer a good performance, such as CFS. It is particularly relevant to observe that the effect on the estimates of using chi-square and OneR algorithms was below 0.9 more than 40% of the times when Raking calibration was used.

Application study
This section presents an application of variable selection for PSA in a real-world context, to estimate the population mean of two variables using a probability and a nonprobability sample. The application takes place within a study on abuse and dependence in a population of university students. The probability sample used as the reference sample was obtained from a survey conducted in 2015,targeting students at the University of Granada (UGR), Spain. The sample was composed of n r = 856 respondents, recruited in faceto-face interviews under a three-stage cluster sampling design, which produced an estimated sampling error of ±3.3% in the case of p = q = 0.5 with a confidence level of 95%. The survey questionnaire included screening instruments for abuse and dependence, namely the Spanish Mobile Phone Abuse Questionnaire (ATeMo) (Olivencia-Carrión et al. 2018), which provides a score between 0 and 100 points that reflects the level of mobile phone abuse of the participant. The ATeMo Table 4 Estimated mean and median of Relative Bias and Effect of estimates using PSA for each algorithm, and number of times its estimates have been among the best (Relative Bias less than 1% greater than the minimum) or have been more efficient than using all variables (Effect under 1 and under 0.9) in the real (bootstrapped) data simulation considering inclusion probabilities proportional to age in the nonprobability sample  (Legleye et al. 2007) and the Severity of Dependence Scale (SDS) (Gossop et al. 1995), together with subscales regarding internet and videogames addiction from the MULTICAGE-CAD4 instrument (Pedrero-Pérez et al. 2007). The survey also recorded the age, gender and university faculty of each participant.
The nonprobability sample was derived from a survey completed by selfselected respondents, conducted in January 2018 and also targeting UGR students. The sample was composed of n v = 176 respondents, who were recruited via snowball sampling performed by the students themselves. All of the variables included in this survey were measured in the reference sample. However, some data preprocessing was performed prior to the analysis; four respondents were ruled out because they were under 18 years old, as were another 43, who left more than 85% of the questionnaire items unanswered or who left blank all of the items of any of the scales. The final sample size, therefore, was n v = 129 individuals. Missing data present in the sample was imputed using the Classification and Regression Trees (CART) algorithm (Breiman et al. 1984).
Age, gender and faculty were used as calibration variables in Raking, as the population totals (but not the cross-probabilities) were available. The covariates eligible for PSA were the total score for the CAST and SDS scales,the MULTICAGE subscales (internet and videogames), and the variables used for calibration: age, gender, and faculty. In total, seven variables were eligible for propensity modelling. The two variables of interest were present in both samples; this is not a feasible situation in real-world applications of PSA (the target variable would not be available in the probability sample) but in this case it allowed us to compare the estimations from both samples. These variables were: -Mean score on the total ATeMo scale, which was 30.066 units in the reference sample (with a sample standard deviation of 15 units) and 32.558 units in the unweighted convenience sample (with a sample standard deviation of 13.99 units). -Mean score on the item "I have tried to spend less time using my mobile phone but I cannot do it" (number 16 in the ATeMo instrument). The mean score of this item in the reference sample was 0.776 (with a sample standard deviation of 1.002) while in the unweighted convenience sample it was 1.217 (with a sample standard deviation of 1.132), this being the greatest difference observed in in any ATeMo item between the reference sample and the convenience sample. Table 5 shows the distributions of the covariates available for PSA in both samples. Except for gender, the values differ greatly in the distribution of the covariates between the two samples. Overall, respondents to the online sample were younger and more prone to cannabis consumption. In addition, their score for the MUL-TICAGE subscales of internet and videogames addiction tended to be higher than those of the reference sample members. Finally, the Science Faculty at the UGR was clearly overrepresented in the online sample, as to a lesser extent was the Medicine Faculty, while the other faculties were underrepresented. Given that the variability between samples can be identified in the covariates, it seems likely that PSA might be helpful to obtain more efficient estimates out of the online sample.
Estimation of the population means followed the same procedure as described in Sect. 4.3: each algorithm for variable selection was applied before PSA (with the same predictive models -and hyperparameter optimisation-as in the simulations: logistic regression, GBM, k-NN and neural networks) using the described reference and convenience samples, and the resulting weights were used directly in the estimators or as initial weights for Raking calibration. The estimated population means for each combination of methods and the estimated Leave-One-Out jackknife variance (Quenouille 1956) are shown in Tables 6 and 7 respectively.
In all cases, the use of variable selection algorithms made the estimates closer to the value observed in the reference sample. For estimation of Item number 16, selecting variables that set the indicator variable of inclusion in s v (z) as the target variable gave subsets that provided the closest estimates for each predictive algorithm, while for the ATeMo score the best choice was to set the variable of interest (y) as the target in the variable selection algorithms. Raking calibration also helped provide estimates that were closer to the reference sample one, especially in the case of Item number 16. On the other hand, the application of these methods increased the variance of the estimator, although in general this increase was greater when any variable selection algorithm was used (with some exceptions).

Discussion and conclusions
In propensity estimation models for online surveys, the question of the variables to be included has been widely discussed, and in some cases questions have been included specifically to distinguish between the potentially covered population and target population individuals ). Informative variables can be selected by the practitioner prior to the study, especially when there is some knowledge on the relationships between variables. However, there is often no information at all on the relationships present in the variables prior to the study, and this circumstance is even more likely in high dimensional contexts, which are becoming ever-more frequent with the development of Big Data methods in survey sampling.
In such cases, variable or feature selection algorithms may contribute to identifying the most informative subset of variables. The simulations performed in our study, using synthetic data and a real survey, reveal the impact of variable selection. In building the models, we also considered machine learning classification algorithms and the subsequent application of Raking calibration, in order to determine which alternatives are most effective in terms of bias removal.
Our analysis shows that feature selection makes a significant contribution to reducing relative bias. However, the best feature selection algorithm, in this respect  and regarding its effect on the estimation, varies according to the dataset considered and the adjustment choices made. The best variable selection method depends on the dataset, meaning there is no one-size-fits-all solution. However, the reduction of model complexity associated with variable selection consistently produced more efficient estimators. As expected, selecting variables according to their impact on the outcome variable provided the best results overall. In line with Austin and Stuart (2015), we find that the propensity score works on the covariates included in the model, so it is preferable to include prognostically important variables (related to the outcome) as the probability to mitigate the bias in the estimation of the target variable will also be higher. In view of these results, in practice the combination of several variable selection approaches, rather than just one, might be useful to identify the best subset in each situation.
Regarding other adjustment methods, Raking calibration after PSA proved to be the most efficient technique in almost all cases. The redundancy of variables between adjustments can reduce the efficiency of their combination in some cases, as observed by Lee and Valliant (2009), who reported that the use of the same variables for PSA and calibration resulted in estimates which, despite being less biased than estimates using only PSA, underperformed versus adjustments with no redundancy.
On the other hand, the use of classification algorithms instead of logistic regression for estimating propensities was advantageous overall, but only for certain algorithms and with no clear view as to which was the best algorithm for estimation. The application of this sort of algorithm in nonprobability sampling was recently studied by Buelens et al. (2018) as an option for model-based estimates, and by Castro-Martín et al. (2020b), Ferri-García and  and Ferri-Garca et al. (2020) for PSA in online surveys. It has also been studied for PSA in nonresponse adjustment (Phipps and Toth 2012;Buskirk and Kolenikov 2015), with promising results. Further studies should take into account this approach, together with the use of a wider range of algorithms, and should consider how preprocessing (such as the feature selection applied in the present study) might influence their performance in propensity estimation.
Further research is needed regarding the implications of variable selection on nonprobability samples, as our study presents certain limitations. Most importantly, relatively few covariates were available for each simulation and for the application study. Originally, feature selection algorithms were intended to reduce dimensionality in large data sets, facilitating the selection of only the most significant variables for prediction. Further research into these algorithms in PSA for selection bias treatment using a larger number of covariates would enhance our understanding of these questions. However, our results also support their use in a low dimensional context, meaning that the value of these algorithms could extend beyond computing optimisation. For example, the use of variable selection algorithms could be extended to calibration; although research has shown their potential and some methods have been developed in this area (Chen et al. 2019), further study is needed to consider this topic, as calibration requires little information and therefore can be more widely applied. Finally, the use of more powerful algorithms for propensity estimation, such as deep learning techniques, should be considered in future studies, as these methods usually involve automatic variable selection and could provide more precise estimates.

Conflict of interest
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.   The closer to zero a value, the less biased the mean estimate obtained Table 9 Mean Relative Bias of the estimates of population means for variables y 5 , y 6 , y 7 , y 8 in the artificial data simulation for each combination of methods The closer to zero a value, the less biased the mean estimate obtained

Table 10
Effect of the estimates of population means for variables y 1 , y 2 , y 3 , y 4 in the artificial data simulation for each combination of methods   Values greater than one indicate inefficiency, while values below one show that the use of a given variable selection method provides more efficient estimates than the case in which all variables are used. Values in bold indicate that the hypothesis of the effect being equal or greater than 1 can be rejected (α = 0.05) for that combination of methods *Reject the null hypothesis that the effect is equal or greater than 1 for this combination of methods (α = 0.05)

Table 11
Effect of the estimates of population means for variables y 5 , y 6 , y 7 , y 8 in the artificial data simulation for each combination of methods The closer to zero a value, the less biased the mean estimate  Values greater than one indicate inefficiency, while values below one suggest that the use of a given variable selection method provides more efficient estimates than the case in which all variables are used. Values in bold indicate that the hypothesis of the effect being equal or greater than 1 can be rejected (α = 0.05) for that combination of methods *Reject the null hypothesis that the effect is equal or greater than 1 for this combination of methods (α = 0.05)