Abstract
The paper deals with the scenario where some covariates are observed by design for a subset of the observations only. In the example treated in the paper this occurs with a two phase sampling scheme where in the first phase a relatively large sample is drawn to record a response variable Y and a set of (cheap) covariates x. In a second phase a smaller sample is drawn from the first phase sample where additional (usually expensive) covariates z are also recorded. The second phase can be drawn with unequal probability sampling, where the sampling weights depend on the observed Y and x. The overall intention is to fit a regression model of Y on both, x and z. Due to the design of the data collection we are faced with missing values for z for a majority of observations. We propose an approximate estimation approach using semi-parametric mean and variance regression of Y on x only and augment this fit with a full regression model of Y on x and z. The idea extends the approach of Little (1992) towards non-normal data and non-linear models. The proposed estimation is numerically rather simple and performs convincingly well in simulation studies compared to alternatives such as complete-case and multiple imputation analysis.
Similar content being viewed by others
References
Anderson TW (1957) Maximum likelihood estimates for a multivariate normal distribution when some observations are missing. J Am Stat Assoc 52(278):200–203
Boor CD (1972) On calculating with B-splines. J Approx Theory 6(1):50–62
Carpenter JR, Kenward M (2013) Multiple imputation and its applications, 1st edn. Wiley, Chichester
Deville JC, Tille Y (1998) Unequal probability sampling without replacement through a splitting method. Biometrika 85(1):89–101
Donders ART, van der Heijden GJMG, Stijnen T, Moons KGM (2006) Review: a gentle introduction to imputation of missing values. J Clin Epidemiol 59(10):1087–1091
Eilers PHC, Marx BD (1996) Flexible smoothing with B-splines and penalties. Stat Sci 11(2):89–121
Fahrmeir L, Gieger C, Klinger A (1998) Econometrics in theory and practice. Physica-Verlag, Heidelberg
Fitzenberger B, Fuchs B (2017) The residency discount for rents in Germany and the tenancy law reform act 2001: evidence from quantile regressions. German Econ Rev 18(2):212–236
Hanif M, Brewer KRW (1980) Sampling with unequal probabilities without replacement: a review. Int Stat Rev 48(3):317–335
Hayati RP, Lee KJ, Simpson JA (2015) The rise of multiple imputation: a review of the reporting and implementation of the method in medical research. Med Res Methodol 15 30
Horton NJ, Kleinman KP (2007) Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am Stat 61(1):79–90
Ibrahim JG, Chen MH, Lipsitz SR, Herring AH (2005) Missing data methods for generalizes linear models: a comparative review. J Am Stat Assoc 100(469):332–346
Lawless JF, Kalbeisch JD, Wild CJ (1999) Semiparametric methods for response selective and missing data problems in regression. J R Stat Soc 61(2):413–438
Liang H (2008) Generalized partially linear models with missing covariates. J Multivar Anal 99(5):880–895
Liang H, Wang S, Robins JM, Carroll RJ (2004) Estimation in partially linear models with missing covariates. J Am Stat Assoc 99(466):357–367
Little RJA (1992) Regression with missing X’s: a review. J Am Stat Assoc 87(420):1227–1237
Little R, An H (2004) Robust likelihood-based analysis of multivariate data with missing values. Stat Sin 14(3):949–968
Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York
Lumley T (2017) Robustness of semiparametric efficiency in nearly-true models for two-phase samples. arXiv:1707.05924
Mandallaz D, Breschan J, Hill A (2013) New regression estimators in forest inventory with two phase sampling and partially exhaustive information: a design based monte carlo approach with applications to small area estimation. Can J For Res 43(11):1023–1031
Mcleish DL, Struthers CA (2006) Estimation of regression parameters in missing data problems. Can J Stat 34(2):233–259
Meng XL (2000) Missing data: dial m for ??? J Am Stat Assoc 95(452):1325–1330
Mitra R, Reiter JP (2016) A comparison of two methods of estimating propensity scores after multiple imputation. Stat Methods Med Res 25(1):188–204
O’Sullivan F (1986) A statistical perspective on ill-posed inverse problems. Stat Sci 1(4):502–518
Qin G, Zhu Z, Fung WK (2012) Robust estimation of the generalised partial linear model with missing covariates. J Nonparametric Stat 24(2):517–530
Robins JM, Rotnitzky A, Zhao LP (1994) Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc 89(427):846–866
Robins JM, Rotnitzky A, Zhao LP (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc 90(429):106–121
Ruppert D, Wand MP, Carroll RJ (2003) Semiparametric regression. Cambridge University Press, Cambridge
Ruppert D, Wand MP, Carroll RJ (2009) Semiparametric regression during 2003–2007. Electron J Stat 3:1193–1256
Saegusa T (2014) Bootstrapping two-phase sampling. e-print https://arxiv.org/abs/1406.5580v1
Saegusa T (2015) Variance estimation under two phase sampling. Scand J Stat 42(4):1078–1091
Stasinopoulos DM, Rigby RA, Heller GZ, Voudouris V, De Bastiani F (2017) Flexible regression and smoothing: using GAMLSS in R. Chapman and Hall/CRC, Boca Raton
Thompson SK (2012) Sampling, 3rd edn. Wiley, New York
Tille Y (1996) An elimination procedure of unequal probability sampling without replacement. Biometrika 83(1):238–241
Tille Y (2006) Sampling algorithms. Springer, New York
Tille Y, Matei A (2016) The R package sampling. The comprehensive R archive network. http://cran.r-project.org/
Toutenburg H, Nittner T (2002) Linear regression models with incomplete categorical covariates. Comput Stat 17:215–232
van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67
Wand MP (2003) Smoothing and mixed models. Comput Stat 18(2):223–249
Wang QH (2009) Statistical estimation in partial linear models with covariate data missing at random. Ann Inst Stat Math 61(1):47–84
Wood SN (2017) Generalized additive models—an introduction with R, 2nd edn. CRC Press, Boca Raton
Yang S, Kim JK (2016) Fractional imputation in survey sampling: a comparative review. Stat Sci 31(3):415–432
Zhang G, Little R (2009) Extensions of the penalized spline of propensity prediction method of imputation. Biometrics 65(3):911–918
Zhang Z, Rockette HE (2005) On maximum likelihood estimation in parametric regression with missing covariates. J Stat Plan Inference 134(1):206–223
Zhang N, Chen H, Elliott M (2016) Nonrespondent subsample multiple imputation in two-phase sampling for nonresponse. J Off Stat 32(3):769–785
Zhao Y, Lawless JF, Mcleish DL (2009) Likelihood methods for regression models with expensive variables missing by design. Biom J 51(1):123–136
Acknowledgements
Mehboob Ali acknowledges financial support provided by Punjab Higher Education Commission for finishing his dissertation at LMU Munich.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Penalized spline smoothing
Penalized spline smoothing is a very general and numerically stable routine for fitting smooth functions. We refer to Ruppert et al. (2003, 2009) for an excessive discussion of the field. Subsequently we sketch the basic ideas. The main principle is to replace the smooth function m(x) in model (7) and the smooth functions \(m_1(x)\) and \(\sigma ^2_1(x)\) in model (8) by spline bases representation. That is we make \(m(x) = B(x)u\) and \(m_1(x) = B(x)u_1\), \(\sigma ^2_i (x) = \exp (B_\sigma (x) u_\sigma \)), where B(x) is spline basis and so is \(B_\sigma (x)\) and in principle we can set \(B(x) = B_\sigma (x)\). A convenient setting is to use a B-spline basis (see Boor 1972), which is constructed from piece-wise polynomial functions, tied together in a continuous (and where necessary differentable) way. This makes the whole model parametric where the spline coefficients u in model (7) and the coefficients \(u_1\) and \(u_\sigma \) are the parameters which need to be estimated. Given that B(x) is chosen as high dimensional basis we find the coefficient vectors to be high dimensional as well. Estimation will induce large estimation variability which is why Eilers and Marx (1996) proposed to impose a penalization on u, e.g. neighboring coefficients should not differ very much. Such penalization can be written as quadratic form \(\lambda u^t D u\) for an appropriately chosen penalty matrix D. This leads to the penalized likelihood
where \(\theta \) is the parameter vector of the model that does also contain the coefficient vector u. Parameter \(\lambda \) plays the role of the smoothing parameter and increasing \(\lambda \) will lead to a more penalized fit. Comprehending the latter component in (A1) as log prior leads to a Bayesian framework so that
where \( D^-\) stands for the (generalized) inverse of D. Now \(\lambda \) plays the role of a hyper parameter which can be estimated using empirical Bayes ideas. We refer to Wand (2003) for details in this direction.
Appendix B: Multivariate metrical variables
We repeat the simulation for bivariate x and simulate data from the model
where \(\varepsilon \sim N(0, \sigma )\) and \(z=(z_1,z_2,z_3)\) is a vector of binary covariates which are correlated with \(x=(x_1, x_2)\). For the functional forms \(m(x_1)\) and \(v(x_2)\) we use the same response functions as shown in Fig. 2 for different values of \(z_1, z_2\) and \(z_3\) for univariate x. The population size, the first and second phase sample size and the true \(\beta _z\) values for covariates z remain unchanged as for model (14). We consider two cases here. In the first case, the response variable Y and the covariates \(x_1\), \(x_2\) are observed in first phase \(s_1\) while covariates z are missing and observed in \(s_2\) only. In the second case, we observe the values of a response variable Y and the covariate \(x_1\) while \(x_2\) and z are missing in first phase and observed in second phase sample only. We use \(\varepsilon \sim N(0, 1)\) and \(\varepsilon \sim N(0, 1.5)\) for model (B1) for the case 1 and 2, respectively. The covariates \(x_1\) and \(x_2\) are generated independently from a uniform distribution with parameters \(x_1 \sim (20, 160)\) and \(x_2 \sim (25, 100)\) for case 1 and \(x_1 \sim (20, 160)\) and \(x_2 \sim (5, 20)\) for case 2. The covariates z are generated from a Bernoulli distribution using both variables \(x_1\) and \(x_2\) in response functions similar as in the univariate case. The ratio of the prediction errors are shown in Figs. 11 and 12 for case 1, and 13 and 14 for case 2 and the median values of mean squared prediction error for both cases are given in Table 5. The overall interpretation remains unchanged. The bias and the estimated variance of the regression coefficient \(\hat{\beta _2}\) are given in Table 6. The results are similar to those for univariate x as discussed in Sect. 3.1.
Rights and permissions
About this article
Cite this article
Kauermann, G., Ali, M. Semi-parametric regression when some (expensive) covariates are missing by design. Stat Papers 62, 1675–1696 (2021). https://doi.org/10.1007/s00362-019-01152-5
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-019-01152-5