Skip to main content
Log in

Semi-parametric regression when some (expensive) covariates are missing by design

  • Regular Article
  • Published:
Statistical Papers Aims and scope Submit manuscript

Abstract

The paper deals with the scenario where some covariates are observed by design for a subset of the observations only. In the example treated in the paper this occurs with a two phase sampling scheme where in the first phase a relatively large sample is drawn to record a response variable Y and a set of (cheap) covariates x. In a second phase a smaller sample is drawn from the first phase sample where additional (usually expensive) covariates z are also recorded. The second phase can be drawn with unequal probability sampling, where the sampling weights depend on the observed Y and x. The overall intention is to fit a regression model of Y on both, x and z. Due to the design of the data collection we are faced with missing values for z for a majority of observations. We propose an approximate estimation approach using semi-parametric mean and variance regression of Y on x only and augment this fit with a full regression model of Y on x and z. The idea extends the approach of Little (1992) towards non-normal data and non-linear models. The proposed estimation is numerically rather simple and performs convincingly well in simulation studies compared to alternatives such as complete-case and multiple imputation analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  • Anderson TW (1957) Maximum likelihood estimates for a multivariate normal distribution when some observations are missing. J Am Stat Assoc 52(278):200–203

    MathSciNet  MATH  Google Scholar 

  • Boor CD (1972) On calculating with B-splines. J Approx Theory 6(1):50–62

    MathSciNet  MATH  Google Scholar 

  • Carpenter JR, Kenward M (2013) Multiple imputation and its applications, 1st edn. Wiley, Chichester

    MATH  Google Scholar 

  • Deville JC, Tille Y (1998) Unequal probability sampling without replacement through a splitting method. Biometrika 85(1):89–101

    MathSciNet  MATH  Google Scholar 

  • Donders ART, van der Heijden GJMG, Stijnen T, Moons KGM (2006) Review: a gentle introduction to imputation of missing values. J Clin Epidemiol 59(10):1087–1091

    Google Scholar 

  • Eilers PHC, Marx BD (1996) Flexible smoothing with B-splines and penalties. Stat Sci 11(2):89–121

    MathSciNet  MATH  Google Scholar 

  • Fahrmeir L, Gieger C, Klinger A (1998) Econometrics in theory and practice. Physica-Verlag, Heidelberg

    MATH  Google Scholar 

  • Fitzenberger B, Fuchs B (2017) The residency discount for rents in Germany and the tenancy law reform act 2001: evidence from quantile regressions. German Econ Rev 18(2):212–236

    Google Scholar 

  • Hanif M, Brewer KRW (1980) Sampling with unequal probabilities without replacement: a review. Int Stat Rev 48(3):317–335

    MathSciNet  MATH  Google Scholar 

  • Hayati RP, Lee KJ, Simpson JA (2015) The rise of multiple imputation: a review of the reporting and implementation of the method in medical research. Med Res Methodol 15 30

  • Horton NJ, Kleinman KP (2007) Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am Stat 61(1):79–90

    MathSciNet  Google Scholar 

  • Ibrahim JG, Chen MH, Lipsitz SR, Herring AH (2005) Missing data methods for generalizes linear models: a comparative review. J Am Stat Assoc 100(469):332–346

    MATH  Google Scholar 

  • Lawless JF, Kalbeisch JD, Wild CJ (1999) Semiparametric methods for response selective and missing data problems in regression. J R Stat Soc 61(2):413–438

    MathSciNet  MATH  Google Scholar 

  • Liang H (2008) Generalized partially linear models with missing covariates. J Multivar Anal 99(5):880–895

    MathSciNet  MATH  Google Scholar 

  • Liang H, Wang S, Robins JM, Carroll RJ (2004) Estimation in partially linear models with missing covariates. J Am Stat Assoc 99(466):357–367

    MathSciNet  MATH  Google Scholar 

  • Little RJA (1992) Regression with missing X’s: a review. J Am Stat Assoc 87(420):1227–1237

    Google Scholar 

  • Little R, An H (2004) Robust likelihood-based analysis of multivariate data with missing values. Stat Sin 14(3):949–968

    MathSciNet  MATH  Google Scholar 

  • Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  • Lumley T (2017) Robustness of semiparametric efficiency in nearly-true models for two-phase samples. arXiv:1707.05924

  • Mandallaz D, Breschan J, Hill A (2013) New regression estimators in forest inventory with two phase sampling and partially exhaustive information: a design based monte carlo approach with applications to small area estimation. Can J For Res 43(11):1023–1031

    Google Scholar 

  • Mcleish DL, Struthers CA (2006) Estimation of regression parameters in missing data problems. Can J Stat 34(2):233–259

    MathSciNet  MATH  Google Scholar 

  • Meng XL (2000) Missing data: dial m for ??? J Am Stat Assoc 95(452):1325–1330

    MathSciNet  MATH  Google Scholar 

  • Mitra R, Reiter JP (2016) A comparison of two methods of estimating propensity scores after multiple imputation. Stat Methods Med Res 25(1):188–204

    MathSciNet  Google Scholar 

  • O’Sullivan F (1986) A statistical perspective on ill-posed inverse problems. Stat Sci 1(4):502–518

    MathSciNet  MATH  Google Scholar 

  • Qin G, Zhu Z, Fung WK (2012) Robust estimation of the generalised partial linear model with missing covariates. J Nonparametric Stat 24(2):517–530

    MathSciNet  MATH  Google Scholar 

  • Robins JM, Rotnitzky A, Zhao LP (1994) Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc 89(427):846–866

    MathSciNet  MATH  Google Scholar 

  • Robins JM, Rotnitzky A, Zhao LP (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc 90(429):106–121

    MathSciNet  MATH  Google Scholar 

  • Ruppert D, Wand MP, Carroll RJ (2003) Semiparametric regression. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Ruppert D, Wand MP, Carroll RJ (2009) Semiparametric regression during 2003–2007. Electron J Stat 3:1193–1256

    MathSciNet  MATH  Google Scholar 

  • Saegusa T (2014) Bootstrapping two-phase sampling. e-print https://arxiv.org/abs/1406.5580v1

  • Saegusa T (2015) Variance estimation under two phase sampling. Scand J Stat 42(4):1078–1091

    MathSciNet  MATH  Google Scholar 

  • Stasinopoulos DM, Rigby RA, Heller GZ, Voudouris V, De Bastiani F (2017) Flexible regression and smoothing: using GAMLSS in R. Chapman and Hall/CRC, Boca Raton

    Google Scholar 

  • Thompson SK (2012) Sampling, 3rd edn. Wiley, New York

    MATH  Google Scholar 

  • Tille Y (1996) An elimination procedure of unequal probability sampling without replacement. Biometrika 83(1):238–241

    MathSciNet  MATH  Google Scholar 

  • Tille Y (2006) Sampling algorithms. Springer, New York

    MATH  Google Scholar 

  • Tille Y, Matei A (2016) The R package sampling. The comprehensive R archive network. http://cran.r-project.org/

  • Toutenburg H, Nittner T (2002) Linear regression models with incomplete categorical covariates. Comput Stat 17:215–232

    MathSciNet  MATH  Google Scholar 

  • van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67

    Google Scholar 

  • Wand MP (2003) Smoothing and mixed models. Comput Stat 18(2):223–249

    MATH  Google Scholar 

  • Wang QH (2009) Statistical estimation in partial linear models with covariate data missing at random. Ann Inst Stat Math 61(1):47–84

    MathSciNet  MATH  Google Scholar 

  • Wood SN (2017) Generalized additive models—an introduction with R, 2nd edn. CRC Press, Boca Raton

    MATH  Google Scholar 

  • Yang S, Kim JK (2016) Fractional imputation in survey sampling: a comparative review. Stat Sci 31(3):415–432

    MathSciNet  MATH  Google Scholar 

  • Zhang G, Little R (2009) Extensions of the penalized spline of propensity prediction method of imputation. Biometrics 65(3):911–918

    MathSciNet  MATH  Google Scholar 

  • Zhang Z, Rockette HE (2005) On maximum likelihood estimation in parametric regression with missing covariates. J Stat Plan Inference 134(1):206–223

    MathSciNet  MATH  Google Scholar 

  • Zhang N, Chen H, Elliott M (2016) Nonrespondent subsample multiple imputation in two-phase sampling for nonresponse. J Off Stat 32(3):769–785

    Google Scholar 

  • Zhao Y, Lawless JF, Mcleish DL (2009) Likelihood methods for regression models with expensive variables missing by design. Biom J 51(1):123–136

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

Mehboob Ali acknowledges financial support provided by Punjab Higher Education Commission for finishing his dissertation at LMU Munich.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Göran Kauermann.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Penalized spline smoothing

Penalized spline smoothing is a very general and numerically stable routine for fitting smooth functions. We refer to Ruppert et al. (2003, 2009) for an excessive discussion of the field. Subsequently we sketch the basic ideas. The main principle is to replace the smooth function m(x) in model (7) and the smooth functions \(m_1(x)\) and \(\sigma ^2_1(x)\) in model (8) by spline bases representation. That is we make \(m(x) = B(x)u\) and \(m_1(x) = B(x)u_1\), \(\sigma ^2_i (x) = \exp (B_\sigma (x) u_\sigma \)), where B(x) is spline basis and so is \(B_\sigma (x)\) and in principle we can set \(B(x) = B_\sigma (x)\). A convenient setting is to use a B-spline basis (see Boor 1972), which is constructed from piece-wise polynomial functions, tied together in a continuous (and where necessary differentable) way. This makes the whole model parametric where the spline coefficients u in model (7) and the coefficients \(u_1\) and \(u_\sigma \) are the parameters which need to be estimated. Given that B(x) is chosen as high dimensional basis we find the coefficient vectors to be high dimensional as well. Estimation will induce large estimation variability which is why Eilers and Marx (1996) proposed to impose a penalization on u, e.g. neighboring coefficients should not differ very much. Such penalization can be written as quadratic form \(\lambda u^t D u\) for an appropriately chosen penalty matrix D. This leads to the penalized likelihood

$$\begin{aligned} l(\theta ) - \frac{1}{2}\lambda u^t D u \end{aligned}$$
(A1)

where \(\theta \) is the parameter vector of the model that does also contain the coefficient vector u. Parameter \(\lambda \) plays the role of the smoothing parameter and increasing \(\lambda \) will lead to a more penalized fit. Comprehending the latter component in (A1) as log prior leads to a Bayesian framework so that

$$\begin{aligned}&u \sim N(0, \lambda ^{-1} D^-) \\&y|u \sim \exp (l(\theta )) \end{aligned}$$

where \( D^-\) stands for the (generalized) inverse of D. Now \(\lambda \) plays the role of a hyper parameter which can be estimated using empirical Bayes ideas. We refer to Wand (2003) for details in this direction.

Appendix B: Multivariate metrical variables

We repeat the simulation for bivariate x and simulate data from the model

$$\begin{aligned} Y = \beta _0 + m(x_1) + v(x_2) + z\beta _z + \varepsilon \end{aligned}$$
(B1)

where \(\varepsilon \sim N(0, \sigma )\) and \(z=(z_1,z_2,z_3)\) is a vector of binary covariates which are correlated with \(x=(x_1, x_2)\). For the functional forms \(m(x_1)\) and \(v(x_2)\) we use the same response functions as shown in Fig. 2 for different values of \(z_1, z_2\) and \(z_3\) for univariate x. The population size, the first and second phase sample size and the true \(\beta _z\) values for covariates z remain unchanged as for model (14). We consider two cases here. In the first case, the response variable Y and the covariates \(x_1\), \(x_2\) are observed in first phase \(s_1\) while covariates z are missing and observed in \(s_2\) only. In the second case, we observe the values of a response variable Y and the covariate \(x_1\) while \(x_2\) and z are missing in first phase and observed in second phase sample only. We use \(\varepsilon \sim N(0, 1)\) and \(\varepsilon \sim N(0, 1.5)\) for model (B1) for the case 1 and 2, respectively. The covariates \(x_1\) and \(x_2\) are generated independently from a uniform distribution with parameters \(x_1 \sim (20, 160)\) and \(x_2 \sim (25, 100)\) for case 1 and \(x_1 \sim (20, 160)\) and \(x_2 \sim (5, 20)\) for case 2. The covariates z are generated from a Bernoulli distribution using both variables \(x_1\) and \(x_2\) in response functions similar as in the univariate case. The ratio of the prediction errors are shown in Figs. 11 and 12 for case 1, and 13 and 14 for case 2 and the median values of mean squared prediction error for both cases are given in Table 5. The overall interpretation remains unchanged. The bias and the estimated variance of the regression coefficient \(\hat{\beta _2}\) are given in Table 6. The results are similar to those for univariate x as discussed in Sect. 3.1.

Fig. 11
figure 11

\(n_2=200\): ratio of mean squared prediction error for simulated data. Second phase sample selection with equal, Tille covariate and Tille residual dependent probability sampling with case 1

Fig. 12
figure 12

\(n_2=400\): ratio of mean squared prediction error for simulated data. Second phase sample selection with equal, Tille covariate and Tille residual dependent probability sampling with case 1

Fig. 13
figure 13

\(n_2=200\): ratio of mean squared prediction error for simulated data. Second phase sample selection with equal, Tille covariate and Tille residual dependent probability sampling with case 2

Fig. 14
figure 14

\(n_2=400\): ratio of mean squared prediction error for simulated data. Second phase sample selection with equal, Tille covariate and Tille residual dependent probability sampling with case 2

Table 5 Median of mean squared prediction error for simulated data (for bivariate x)
Table 6 Bias and estimated variance for regression coefficient \({\hat{\beta }}_2\) for simulated data (for bivariate x)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kauermann, G., Ali, M. Semi-parametric regression when some (expensive) covariates are missing by design. Stat Papers 62, 1675–1696 (2021). https://doi.org/10.1007/s00362-019-01152-5

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00362-019-01152-5

Keywords

Navigation