Semi-parametric regression when some (expensive) covariates are missing by design

Kauermann, Göran; Ali, Mehboob

doi:10.1007/s00362-019-01152-5

Semi-parametric regression when some (expensive) covariates are missing by design

Regular Article
Published: 01 January 2020

Volume 62, pages 1675–1696, (2021)
Cite this article

Statistical Papers Aims and scope Submit manuscript

256 Accesses
1 Citation
Explore all metrics

Abstract

The paper deals with the scenario where some covariates are observed by design for a subset of the observations only. In the example treated in the paper this occurs with a two phase sampling scheme where in the first phase a relatively large sample is drawn to record a response variable Y and a set of (cheap) covariates x. In a second phase a smaller sample is drawn from the first phase sample where additional (usually expensive) covariates z are also recorded. The second phase can be drawn with unequal probability sampling, where the sampling weights depend on the observed Y and x. The overall intention is to fit a regression model of Y on both, x and z. Due to the design of the data collection we are faced with missing values for z for a majority of observations. We propose an approximate estimation approach using semi-parametric mean and variance regression of Y on x only and augment this fit with a full regression model of Y on x and z. The idea extends the approach of Little (1992) towards non-normal data and non-linear models. The proposed estimation is numerically rather simple and performs convincingly well in simulation studies compared to alternatives such as complete-case and multiple imputation analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sampling Techniques for Quantitative Research

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Article Open access 05 May 2021

Violating the normality assumption may be the lesser of two evils

Article Open access 07 May 2021

References

Anderson TW (1957) Maximum likelihood estimates for a multivariate normal distribution when some observations are missing. J Am Stat Assoc 52(278):200–203
MathSciNet MATH Google Scholar
Boor CD (1972) On calculating with B-splines. J Approx Theory 6(1):50–62
MathSciNet MATH Google Scholar
Carpenter JR, Kenward M (2013) Multiple imputation and its applications, 1st edn. Wiley, Chichester
MATH Google Scholar
Deville JC, Tille Y (1998) Unequal probability sampling without replacement through a splitting method. Biometrika 85(1):89–101
MathSciNet MATH Google Scholar
Donders ART, van der Heijden GJMG, Stijnen T, Moons KGM (2006) Review: a gentle introduction to imputation of missing values. J Clin Epidemiol 59(10):1087–1091
Google Scholar
Eilers PHC, Marx BD (1996) Flexible smoothing with B-splines and penalties. Stat Sci 11(2):89–121
MathSciNet MATH Google Scholar
Fahrmeir L, Gieger C, Klinger A (1998) Econometrics in theory and practice. Physica-Verlag, Heidelberg
MATH Google Scholar
Fitzenberger B, Fuchs B (2017) The residency discount for rents in Germany and the tenancy law reform act 2001: evidence from quantile regressions. German Econ Rev 18(2):212–236
Google Scholar
Hanif M, Brewer KRW (1980) Sampling with unequal probabilities without replacement: a review. Int Stat Rev 48(3):317–335
MathSciNet MATH Google Scholar
Hayati RP, Lee KJ, Simpson JA (2015) The rise of multiple imputation: a review of the reporting and implementation of the method in medical research. Med Res Methodol 15 30
Horton NJ, Kleinman KP (2007) Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am Stat 61(1):79–90
MathSciNet Google Scholar
Ibrahim JG, Chen MH, Lipsitz SR, Herring AH (2005) Missing data methods for generalizes linear models: a comparative review. J Am Stat Assoc 100(469):332–346
MATH Google Scholar
Lawless JF, Kalbeisch JD, Wild CJ (1999) Semiparametric methods for response selective and missing data problems in regression. J R Stat Soc 61(2):413–438
MathSciNet MATH Google Scholar
Liang H (2008) Generalized partially linear models with missing covariates. J Multivar Anal 99(5):880–895
MathSciNet MATH Google Scholar
Liang H, Wang S, Robins JM, Carroll RJ (2004) Estimation in partially linear models with missing covariates. J Am Stat Assoc 99(466):357–367
MathSciNet MATH Google Scholar
Little RJA (1992) Regression with missing X’s: a review. J Am Stat Assoc 87(420):1227–1237
Google Scholar
Little R, An H (2004) Robust likelihood-based analysis of multivariate data with missing values. Stat Sin 14(3):949–968
MathSciNet MATH Google Scholar
Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York
MATH Google Scholar
Lumley T (2017) Robustness of semiparametric efficiency in nearly-true models for two-phase samples. arXiv:1707.05924
Mandallaz D, Breschan J, Hill A (2013) New regression estimators in forest inventory with two phase sampling and partially exhaustive information: a design based monte carlo approach with applications to small area estimation. Can J For Res 43(11):1023–1031
Google Scholar
Mcleish DL, Struthers CA (2006) Estimation of regression parameters in missing data problems. Can J Stat 34(2):233–259
MathSciNet MATH Google Scholar
Meng XL (2000) Missing data: dial m for ??? J Am Stat Assoc 95(452):1325–1330
MathSciNet MATH Google Scholar
Mitra R, Reiter JP (2016) A comparison of two methods of estimating propensity scores after multiple imputation. Stat Methods Med Res 25(1):188–204
MathSciNet Google Scholar
O’Sullivan F (1986) A statistical perspective on ill-posed inverse problems. Stat Sci 1(4):502–518
MathSciNet MATH Google Scholar
Qin G, Zhu Z, Fung WK (2012) Robust estimation of the generalised partial linear model with missing covariates. J Nonparametric Stat 24(2):517–530
MathSciNet MATH Google Scholar
Robins JM, Rotnitzky A, Zhao LP (1994) Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc 89(427):846–866
MathSciNet MATH Google Scholar
Robins JM, Rotnitzky A, Zhao LP (1995) Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc 90(429):106–121
MathSciNet MATH Google Scholar
Ruppert D, Wand MP, Carroll RJ (2003) Semiparametric regression. Cambridge University Press, Cambridge
MATH Google Scholar
Ruppert D, Wand MP, Carroll RJ (2009) Semiparametric regression during 2003–2007. Electron J Stat 3:1193–1256
MathSciNet MATH Google Scholar
Saegusa T (2014) Bootstrapping two-phase sampling. e-print https://arxiv.org/abs/1406.5580v1
Saegusa T (2015) Variance estimation under two phase sampling. Scand J Stat 42(4):1078–1091
MathSciNet MATH Google Scholar
Stasinopoulos DM, Rigby RA, Heller GZ, Voudouris V, De Bastiani F (2017) Flexible regression and smoothing: using GAMLSS in R. Chapman and Hall/CRC, Boca Raton
Google Scholar
Thompson SK (2012) Sampling, 3rd edn. Wiley, New York
MATH Google Scholar
Tille Y (1996) An elimination procedure of unequal probability sampling without replacement. Biometrika 83(1):238–241
MathSciNet MATH Google Scholar
Tille Y (2006) Sampling algorithms. Springer, New York
MATH Google Scholar
Tille Y, Matei A (2016) The R package sampling. The comprehensive R archive network. http://cran.r-project.org/
Toutenburg H, Nittner T (2002) Linear regression models with incomplete categorical covariates. Comput Stat 17:215–232
MathSciNet MATH Google Scholar
van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67
Google Scholar
Wand MP (2003) Smoothing and mixed models. Comput Stat 18(2):223–249
MATH Google Scholar
Wang QH (2009) Statistical estimation in partial linear models with covariate data missing at random. Ann Inst Stat Math 61(1):47–84
MathSciNet MATH Google Scholar
Wood SN (2017) Generalized additive models—an introduction with R, 2nd edn. CRC Press, Boca Raton
MATH Google Scholar
Yang S, Kim JK (2016) Fractional imputation in survey sampling: a comparative review. Stat Sci 31(3):415–432
MathSciNet MATH Google Scholar
Zhang G, Little R (2009) Extensions of the penalized spline of propensity prediction method of imputation. Biometrics 65(3):911–918
MathSciNet MATH Google Scholar
Zhang Z, Rockette HE (2005) On maximum likelihood estimation in parametric regression with missing covariates. J Stat Plan Inference 134(1):206–223
MathSciNet MATH Google Scholar
Zhang N, Chen H, Elliott M (2016) Nonrespondent subsample multiple imputation in two-phase sampling for nonresponse. J Off Stat 32(3):769–785
Google Scholar
Zhao Y, Lawless JF, Mcleish DL (2009) Likelihood methods for regression models with expensive variables missing by design. Biom J 51(1):123–136
MathSciNet MATH Google Scholar

Download references

Acknowledgements

Mehboob Ali acknowledges financial support provided by Punjab Higher Education Commission for finishing his dissertation at LMU Munich.

Author information

Authors and Affiliations

Department of Statistics, Ludwig-Maximilians-Universität München, Ludwigstrasse 33, D-80359, Munich, Germany
Göran Kauermann & Mehboob Ali

Authors

Göran Kauermann
View author publications
You can also search for this author in PubMed Google Scholar
Mehboob Ali
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Göran Kauermann.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Penalized spline smoothing

Penalized spline smoothing is a very general and numerically stable routine for fitting smooth functions. We refer to Ruppert et al. (2003, 2009) for an excessive discussion of the field. Subsequently we sketch the basic ideas. The main principle is to replace the smooth function m(x) in model (7) and the smooth functions $m_1(x)$ and $\sigma ^2_1(x)$ in model (8) by spline bases representation. That is we make $m(x) = B(x)u$ and $m_1(x) = B(x)u_1$, $\sigma ^2_i (x) = \exp (B_\sigma (x) u_\sigma $), where B(x) is spline basis and so is $B_\sigma (x)$ and in principle we can set $B(x) = B_\sigma (x)$. A convenient setting is to use a B-spline basis (see Boor 1972), which is constructed from piece-wise polynomial functions, tied together in a continuous (and where necessary differentable) way. This makes the whole model parametric where the spline coefficients u in model (7) and the coefficients $u_1$ and $u_\sigma $ are the parameters which need to be estimated. Given that B(x) is chosen as high dimensional basis we find the coefficient vectors to be high dimensional as well. Estimation will induce large estimation variability which is why Eilers and Marx (1996) proposed to impose a penalization on u, e.g. neighboring coefficients should not differ very much. Such penalization can be written as quadratic form $\lambda u^t D u$ for an appropriately chosen penalty matrix D. This leads to the penalized likelihood

$$\begin{aligned} l(\theta ) - \frac{1}{2}\lambda u^t D u \end{aligned}$$

(A1)

where $\theta $ is the parameter vector of the model that does also contain the coefficient vector u. Parameter $\lambda $ plays the role of the smoothing parameter and increasing $\lambda $ will lead to a more penalized fit. Comprehending the latter component in (A1) as log prior leads to a Bayesian framework so that

$$\begin{aligned}&u \sim N(0, \lambda ^{-1} D^-) \\&y|u \sim \exp (l(\theta )) \end{aligned}$$

where $ D^-$ stands for the (generalized) inverse of D. Now $\lambda $ plays the role of a hyper parameter which can be estimated using empirical Bayes ideas. We refer to Wand (2003) for details in this direction.

Appendix B: Multivariate metrical variables

We repeat the simulation for bivariate x and simulate data from the model

$$\begin{aligned} Y = \beta _0 + m(x_1) + v(x_2) + z\beta _z + \varepsilon \end{aligned}$$

(B1)

where $\varepsilon \sim N(0, \sigma )$ and $z=(z_1,z_2,z_3)$ is a vector of binary covariates which are correlated with $x=(x_1, x_2)$. For the functional forms $m(x_1)$ and $v(x_2)$ we use the same response functions as shown in Fig. 2 for different values of $z_1, z_2$ and $z_3$ for univariate x. The population size, the first and second phase sample size and the true $\beta _z$ values for covariates z remain unchanged as for model (14). We consider two cases here. In the first case, the response variable Y and the covariates $x_1$, $x_2$ are observed in first phase $s_1$ while covariates z are missing and observed in $s_2$ only. In the second case, we observe the values of a response variable Y and the covariate $x_1$ while $x_2$ and z are missing in first phase and observed in second phase sample only. We use $\varepsilon \sim N(0, 1)$ and $\varepsilon \sim N(0, 1.5)$ for model (B1) for the case 1 and 2, respectively. The covariates $x_1$ and $x_2$ are generated independently from a uniform distribution with parameters $x_1 \sim (20, 160)$ and $x_2 \sim (25, 100)$ for case 1 and $x_1 \sim (20, 160)$ and $x_2 \sim (5, 20)$ for case 2. The covariates z are generated from a Bernoulli distribution using both variables $x_1$ and $x_2$ in response functions similar as in the univariate case. The ratio of the prediction errors are shown in Figs. 11 and 12 for case 1, and 13 and 14 for case 2 and the median values of mean squared prediction error for both cases are given in Table 5. The overall interpretation remains unchanged. The bias and the estimated variance of the regression coefficient $\hat{\beta _2}$ are given in Table 6. The results are similar to those for univariate x as discussed in Sect. 3.1.

Table 5 Median of mean squared prediction error for simulated data (for bivariate x)

Full size table

Table 6 Bias and estimated variance for regression coefficient ${\hat{\beta }}_2$ for simulated data (for bivariate x)

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kauermann, G., Ali, M. Semi-parametric regression when some (expensive) covariates are missing by design. Stat Papers 62, 1675–1696 (2021). https://doi.org/10.1007/s00362-019-01152-5

Download citation

Received: 22 January 2019
Revised: 15 December 2019
Published: 01 January 2020
Issue Date: August 2021
DOI: https://doi.org/10.1007/s00362-019-01152-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-parametric regression when some (expensive) covariates are missing by design

Abstract

Access this article

Similar content being viewed by others

Sampling Techniques for Quantitative Research

Estimating power in (generalized) linear mixed models: An open introduction and tutorial in R

Violating the normality assumption may be the lesser of two evils

References

Acknowledgements