Some dimension reduction strategies for the analysis of survey data
In the era of big data, researchers interested in developing statistical models are challenged with how to achieve parsimony. Usually, some sort of dimension reduction strategy is employed. Classic strategies are often in the form of traditional inference procedures, such as hypothesis testing; however, the increase in computing capabilities has led to the development of more sophisticated methods. In particular, sufficient dimension reduction has emerged as an area of broad and current interest. While these types of dimension reduction strategies have been employed for numerous data problems, they are scantly discussed in the context of analyzing survey data. This paper provides an overview of some classic and modern dimension reduction methods, followed by a discussion of how to use the transformed variables in the context of analyzing survey data. We highlight some of these methods with an analysis of health insurance coverage using the US Census Bureau’s 2015 Planning Database.
Big data Central mean subspace Flexible models Official statistics Principal component analysis Sufficient dimension reductionIntroduction
The explosion of big data has resulted in both a dramatic increase in the volume of available data and the possibilities of how to use that data. Federal databases—which are based on survey data collected by federal agencies—are key sources of massive datasets and crucial for ongoing research. The need by researchers to analyze not only publicuse data, but also restricteduse microdata, is often pivotal for addressing important research questions. The growing demand for access to such data in the United States is highlighted by the establishment of 27 Federal Statistical Research Data Centers^{1}, which are partnerships between federal statistical agencies and leading research institutions in the United States.
How big data can be leveraged in the construction of official statistics is a matter of ongoing discussion [1]. However, there are major benefits to how big data from federal databases, nonfederal databases, or both, are used. For example, the Committee on National Statistics assembled the Panel to Review the 2010 Census. The Panel suggested more effective use of Census Bureau databases [2], which is consistent with the Census Bureau’s increasing emphasis on accurate modelbased predictions to conduct more efficient and costeffective surveys [3]. Another potential benefit is to improve mutual governmentcitizen understanding [4], which in turn could improve the quality of survey data collected for federal databases.
Combining multiple federal databases, such as through record linkage techniques, can help researchers address more refined questions and produce more powerful statistical analyses. However, the resulting massive datasets often require the researcher to develop and apply a sound strategy for handling an inherently highdimensional problem. Establishing such a strategy is also necessary for the development and dissemination of official statistics, which are based on survey data collected and stored in federal databases. Dimension reduction techniques can be an effective approach for reducing the dimensionality in big data, regardless of its source. However, there is little literature highlighting the efficacy of dimension reduction techniques in the context of analyzing survey data from federal databases. The focus of this paper fills this gap.
Lumley and Scott [5] state in the abstract of their paper, “Data from complex surveys are being used increasingly to build the same sort of explanatory and predictive models used in the rest of statistics.” For example, Gelman [6] discussed the broader issue of survey weighting and regression modeling, with an application of building predictive models using the Social Indicators Survey. The analysis also developed a multilevel regression model, for which some of the standard estimation and inference procedures in those models can be applied [7]. Lumley and Scott [5, 8] demonstrated building regression models—in particular, linear and generalized linear models—using data from the National Health and Nutrition Examination Survey (NHANES). They also provided a thorough discussion about testing in such regression models being fit to survey data. Young et al. [9] demonstrated the appropriateness of using zeroinflated regression models for understanding housing unit adds or deletes in the United States based on the Census Bureau’s Master Address File (MAF). In each of the above examples, inference or variable selection procedures can be employed to determine the “best” predictor variables for the respective model. However, choosing the best strategy for selecting from a large number of predictor variables can be challenging, especially in survey data where multicollinearity is almost always an issue. One appealing approach for such settings is to use dimension reduction.
We provide an analysis of data involving health insurance reform in the United States. Health insurance reform is always a major, and oftentimes controversial, social and political topic. One of the most significant efforts in recent years to health insurance reform in the United States has been the Patient Protection and Affordable Care Act (ACA), also known as “Obamacare.” The ACA became effective in early 2010 with most major provisions phased in by early 2014. The ACA has an individual mandate, which requires each individual to buy insurance or pay a penalty if not covered by an employersponsored health plan or other public insurance plan. While not impacted by the provisions in the ACA, some individuals are covered under more than one health insurance plan for various reasons; e.g., supplementing coverage with a secondary plan for services not covered by their primary plan. Our example focuses on building models of health insurance coverage across the United States.
This paper is organized as follows. In “Dimension reduction techniques” section, we provide a review of principal component analysis, sufficient dimension reduction methods, and the associated algorithms. In “Flexible modeling with the transformed data” section, we discuss how dimension reduction methods can be applied to survey data with many variables, and suggest some flexible modeling techniques that could be used with the transformed data. In “Analyzing health insurance coverage using the 2015 planning database” section, we use dimension reduction to analyze health care coverage based on survey data from the blockgrouplevel 2015 Planning Database (PDB), which contains selected 2010 Census and selected 2009–2013 5year American Community Survey (ACS) estimates. In “Conclusion” section, we discuss some of the conclusions from this work. Finally, in “Summary” section, we summarize what has been presented in this work as well as some general comments about dimension reduction methods.
Dimension reduction techniques
A major use of survey data is in the building of informative predictive models. For our discussion, we consider regression models with a univariate response variable Y and a pdimensional vector of predictors \(\mathbf X .\) In full generality, the goal of regression is to characterize and infer about the conditional distribution of \(Y\mathbf X.\) When p is large, a researcher is often faced with two major challenges, which are especially relevant to the analysis of survey data. First is that the values of the predictors are not controlled at levels as they would be in a properly designed experiment, thus, multicollinearity is often present [10]. Second is that it is often desirable to reduce the number of predictor variables, such that they are still informative about the response. These challenges can be addressed using the methods we discuss in this section. We first present principal component analysis, which is a classic and wellknown multivariate procedure that can be used as a dimension reduction strategy. We then discuss more modern dimension reduction and sufficient dimension reduction techniques, including sliced inverse regression [11], partial sliced inverse regression [12], sliced average variance estimation [13], and principal Hessian direction [14, 15].
Principal components

First principal component: \(\text {PC}_1 = \mathbf a _1^{\text {T}}{} \mathbf X, \) where \(\mathbf a _1\) such that \(\text {Var}(\text {PC}_1) = \mathbf a _1^{\text {T}}\Sigma \mathbf a _1 = \mathop {\max } \nolimits _{{\left\ a \right\ = 1}} \mathbf a ^{\text {T}}\Sigma \mathbf a; \)

Second principal component: \(\text {PC}_2 = \mathbf a _2^{\text {T}}{} \mathbf X, \) where \(\text {Var}(\text {PC}_2) = \mathbf a _2^{\text {T}}\Sigma \mathbf a _2 = \mathop {\max }\nolimits _{{\left\ a \right\ = 1}} \mathbf a ^{\text {T}}\Sigma \mathbf a \) with \(\mathbf a _1^{\text {T}}\Sigma \mathbf a _2 = 0;\)

\(p{\text {th}}\) Principal component: \(\text {PC}_p = \mathbf a _p^{\text {T}}{} \mathbf X ,\) where \(\text {Var}(\text {PC}_p) = \mathbf a _p^{\text {T}}\Sigma \mathbf a _p = \mathop {\max }\nolimits _{{\left\ a \right\ = 1}} \mathbf a ^{\text {T}}\Sigma \mathbf a \) with \(\mathbf a _p^{\text {T}}\Sigma \mathbf a _j = 0,\) \(j=1,\ldots ,p1.\)
Proposition 1
Let \((\lambda _i, \mathbf {\eta }_i)\) for \(i = 1,...,p\) be the eigenstructure of \(\Sigma, \) where \(\lambda _1 \ge \cdots \ge \lambda _p \ge 0.\) Then the \(i{\text {th}}\) principal component is given by \(\text {PC}_i = \mathbf {\eta }_i^T \mathbf {X},\) where \(1 \le i \le p,\) and \(\text {Var}(\text {PC}_i) = \mathbf {\eta }_i^T \Sigma \mathbf {\eta }_i = \lambda _i,\) \(\text {Cov}(\text {PC}_i, \text {PC}_k) = 0\) for \(i\ne k.\)
We can use, for example, the singular value decomposition of the covariance matrix \(\Sigma \) to find the eigenvalues and eigenvectors. \(\Sigma \) is also referred to as the kernel matrix. For consistency with notation used later, we denote \(M_{PC} = \Sigma \) as the kernel matrix for PCA. This is estimated by \(\hat{M}_{PC} = \hat{\Sigma },\) which is based on the sample data.
Principal component analysis is, perhaps, the oldest dimension reduction technique that is still widely used today [17, 18]. Consequently, PCA has been applied to numerous important data problems spanning a wide array of scientific fields. For example, PCA has been used for facial image recognition in image analysis [19], for the analysis of hormone profiles to assess the productivity of plants [20], and as part of a robust decision support tool for facilitating industrial production scheduling [21]. PCA has been applied for various survey data analyses, but due to the sometimes large number of binary or categorical variables in such data, it does not always provide reliable results [22].
Sufficient dimension reduction
Formally, a dimension reduction is a function \(\mathcal {R}(\mathbf X )\) that maps \(\mathbf X \) to a kdimensional subset of the reals such that \(k < p.\) Specifically, we let \(\mathcal {R}(\mathbf X )=\mathbf B ^{\text {T}}{} \mathbf X, \) where \(\mathbf B \) is a \(p\times k\) matrix. We say that a dimension reduction is sufficient if the distribution of \(Y\mathcal {R}(\mathbf X )\) is the same as that for \(Y\mathbf X, \) which is the original conditional distribution of interest in regression models. Combining the notions of dimension reduction and sufficiency, sufficient dimension reduction [11, 15, 23] is used to detect a lower dimension subspace of the predictor space, such that the response variable is independent with the predictor vectors providing all the information of this subspace.
Without loss of information, \(\mathbf {X}\) can be replaced by \(\mathbf {\eta }^\text {T}\mathbf {X},\) where \(\mathbf {\eta }\in \mathbb {R}^{p\times d},\) \(d < p.\) The subspace spanned by the columns of \(\varvec{\eta }\) is called a dimension reduction subspace for the regression of Y on \(\mathbf {X}.\) The intersection of all dimension reduction subspaces is called the central subspace (CS), which we denote by \(\mathcal {S}_{Y\mathbf {X}}\) with dimension \(d = \text {dim}(\mathcal {S}_{Y\mathbf {X}}).\) The basis \(\varvec{\beta }\in \mathbb {R}^{p\times d},\) \(d<p\) of the CS has the property that \(Y \perp \mathbf {X} \mathbf {\beta }^{T}\mathbf {X},\) which is to say that the conditional distribution of \(Y\mathbf {X}\) is the same as the conditional distribution of \(Y\varvec{\beta }^{T}\mathbf {X}.\) Under mild conditions [23], the CS exists and is unique.
Sometimes, the mean function \(\text {E}(Y\mathbf {X})\) may be of primary interest instead of the conditional distribution \(Y\mathbf {X}.\) For such settings, Cook and Li [24] introduced the following. Let \(\varvec{\beta }\in \mathbb {R}^{p\times d},\) \(d<p\) now be the basis for the subspace for \(Y \perp \text {E}(Y\mathbf X ) \mathbf {\beta }^{T}\mathbf {X}.\) This subspace is then called the mean dimension reduction subspace. The intersection of all mean dimension reduction subspaces is called the central mean subspace (CMS), which we denote by \(\mathcal {S}_{\text {E}(Y\mathbf {X})}.\)
 1.
Linearity condition \(\text {E}(\mathbf X \mathbf P _{\mathcal {S}}{} \mathbf X )\) is a linear function of \(\mathbf X .\)
 2.
Constant variancecovariance matrix condition \(\text {Var}(\mathbf X \mathbf P _{\mathcal {S}}{} \mathbf X )\) is a nonrandom matrix.
It is important to emphasize the fundamental difference between PCA and sufficient dimension reduction. PCA reduces the number of predictors without considering the response variables, and choosing the number of principal components is not done through any formal inference paradigm. However, the idea of sufficient dimension reduction is to attain a sufficient subspace, which includes all of the information we need. There are asymptotic results for determining the number of dimensions in sufficient dimension reduction. These asymptotic results are derived and/or discussed in the references we cite for the dimension reduction techniques that we discuss below. Thus, using PCA is somewhat limited because it does not consider the response variable(s), nor does it have a formal inference mechanism for choosing the “best” number of principal components.
Sliced inverse regression
\(\text{ E }(Y\mathbf X )\) is a pdimensional surface where, for now, we assume that all of the variables represented by the columns of \(\mathbf X \) are continuous. The notion of inverse regression works with the curve computed by \(\text{ E }(\mathbf X Y),\) which consists of p onedimensional regressions. Li [11] introduced sliced inverse regression (SIR), which involves dividing the range of the response Y into H nonoverlapping intervals called slices. Letting \(\Sigma = \text {Var}(\mathbf {X})\) and \(\mathbf {Z} = \Sigma ^{1/2} (\mathbf {X} \text {E}(\mathbf {X})),\) we then see that \(\mathcal {S}_{Y\mathbf {X}} = \Sigma ^{1/2}\mathcal {S}_{Y\mathbf {Z}}\) [25]. Hence, we can work on the scale of \(\mathbf {Z}.\) Moreover, under the linearity condition, \(\mathcal {S}_{\text {E}(\mathbf {Z}Y)} \subset \mathcal {S}_{Y\mathbf {Z}}\) and \(\mathcal {S}\{\text {Var}[\text {E}(\mathbf {Z}Y)]\} = \mathcal {S}_{\text {E}(\mathbf {Z}Y)}\) [25]. Thus, we can form the kernel matrix \(M_{SIR} = \text {Var}[\text {E}(\mathbf {Z}Y)].\)
 1.
For \(i=1,\ldots ,n,\) standardize \(\mathbf {x}_i\) into \(\mathbf {z}_i,\) and divide \(y_i\) into H slices. Let \(\hat{f}_h\) be the proportion of the \(y_i\) in slice h, \(h = 1,\ldots ,H.\)
 2.
Compute the sample mean of \(\mathbf {z}\) in each slice, and denote these by \(\bar{\mathbf {z}}_1,\ldots ,\bar{\mathbf {z}}_H.\)
 3.
Form the weighted variancecovariance matrix \(\hat{M}_{SIR} = \sum ^{H}_{h = 1} \hat{f}_h\bar{\mathbf {z}}_h \bar{\mathbf {z}}^\text {T}_h.\)
 4.
Find the eigenstructure of \(\hat{M}_{SIR}{:}\) \((\lambda _i,\mathbf {\eta }_i),\) \(i=1,\ldots ,p.\) The \(d<p\) eigenvectors corresponding to the d largest eigenvalues are the estimated directions of \(\mathcal {S}_{\text {E}(\mathbf {Z}Y)}.\) Then, we transform back to the original \(\mathbf {X}\) scale by calculating \(\hat{\mathbf {\beta }}_l = \hat{\Sigma }^{1/2} \hat{\mathbf {\eta }}_l,\) \(l=1,\ldots ,d.\)
Partial sliced inverse regression
We next consider the case where \(\mathbf X \) can consist of both continuous and categorical predictor variables. In order to accommodate this in a setup similar to SIR, we need to use the notion of partial dimension reduction as introduced in Chiaramonte et al. [12]. Let W be a categorical variable with K levels and define the partial central subspace relative to \(\mathbf {X}\) as the intersection of all subspaces spanned by \(\varvec{\eta }\in \mathbb {R}^{p\times d}\) such that \(Y\perp \mathbf {X}  (\varvec{\eta }^T \mathbf {X}, W).\) Denote the partial central subspace as \(\mathcal {S}^{W}_{Y\mathbf {X}}.\) The relationship between partial and conditional dimension reduction is \( \mathcal {S}^W_{Y\mathbf {X}} = \bigoplus ^K_{k=1} \mathcal {S}_{Y_k\mathbf {X}_k}, \) where \(\mathcal {S}_{Y_k\mathbf {X}_k} \) is the CS conditioned on level k and \(\bigoplus \) is the direct sum.
For each level, the mean and covariance matrix of \(\mathbf {X}_k\) are \(\varvec{\mu }_k\) and \(\Sigma _k,\) respectively. We further assume that the covariance structures are the same across the levels; i.e., \( \Sigma _k = \Sigma _{pool},\) \(k= 1,\ldots , K.\) Now, letting \(\mathbf {Z}_k = \Sigma ^{1/2}_{pool}(\mathbf {X}_k  \varvec{\mu }_k)\) results in \( \mathcal {S}^W_{Y\mathbf {X}} = \Sigma ^{1/2}_{pool}\bigoplus ^K_{k=1} \mathcal {S}_{Y_k\mathbf {Z}_k}. \) Then, we can use SIR for each level to find the kernel matrix \(M_k.\) After averaging these kernel matrices over different levels, we get \(M^W = \sum ^{K}_{k=1} \text {Pr}(W=k) M_k.\)
 1.
For each level k, \(k=1,\ldots ,K,\) calculate \(\bar{\mathbf {x}}_k\) and \(\hat{\Sigma }_k,\) which are the sample mean and sample variancecovariance of \(\mathbf {X}_k,\) respectively. Moreover, calculate the common sample variancecovariance matrix \(\hat{\Sigma }_{pool} = \sum ^K_{k=1}\frac{n_k}{n}\hat{\Sigma }_k\) and \(\mathbf {z}_{ik} = \hat{\Sigma }^{1/2}_{pool}(\mathbf {x}_{ik}  \bar{\mathbf {x}}_{k}),\) \(i=1,\ldots ,n_k.\)
 2.
Apply the steps in the SIR algorithm to get the sample kernel matrix in each level k: \(\hat{M}_k.\) Then \(\hat{M}^W = \sum ^K_{k=1} \frac{n_k}{n}\hat{M}_k.\)
 3.
The first d eigenvectors, \(\hat{\mathbf {\eta }}_1,\ldots ,\hat{\mathbf {\eta }}_d,\) of \(\hat{M}^W\) correspond to the d largest eigenvalues \(\hat{\lambda }_1 \ge \hat{\lambda }_2 \ge \cdots \ge \hat{\lambda }_{d}.\) These eigenvectors are the estimated directions of \(\mathcal {S}_{Y\mathbf {Z}}.\) Then, transform back to the original \(\mathbf {X}\) scale by calculating \(\hat{\mathbf {\beta }}_l = \hat{\Sigma }^{1/2}_{pool} \hat{\mathbf {\eta }}_l,\) \(l=1,\ldots ,d.\)
Sliced average variance estimate
One disadvantage of SIR is that it cannot detect symmetric structure of predictors; however, the sliced average variance estimate (SAVE) method [13] can find the directions, even in the presence of symmetric structures. Under the linearity and constant variance condition, \(\mathbf I _p  \text {Var}(\mathbf {Z}Y) \in \mathcal {S}_{Y\mathbf {Z}},\) where \(\mathbf I _p\) is the \(p\times p\) identity matrix. Hence, \(M_{SAVE} = \text {E}[\mathbf I _p  \text {Var}(\mathbf {Z}Y)]^2.\)
 1.
For \(i = 1,\ldots ,n,\) standardize \(\mathbf {x}_i\) into \(\mathbf {z}_i\) and divide \(y_i\) into H slices. Let \(\hat{f}_h\) be the proportion of the \(y_i\) in slice h, \(h = 1,\ldots ,H.\)
 2.
Compute the sample covariance of \(\mathbf {z}\) in each slice, \(\widehat{\text {Var}}(\mathbf {Z} Y = h).\)
 3.
Form the weighted covariance matrix \(\hat{M}_{SAVE} = \sum ^{H}_{h = 1} \hat{f}_h [\mathbf I _p  \widehat{\text {Var}}(\mathbf {Z} Y = h)]^2.\)
 4.
Find the eigenstructure of \(\hat{M}_{SAVE}\) and take the first d eigenvectors, \(\hat{\mathbf {\eta }}_1,\ldots ,\hat{\mathbf {\eta }}_d.\) which correspond to the d largest eigenvalues \(\hat{\lambda }_1 \ge \hat{\lambda }_2 \ge \cdots \ge \hat{\lambda }_{d}.\) These eigenvectors are the estimated directions of \(\mathcal {S}_{\text {Var}(\mathbf {Z}Y)}.\) Then, transform back to the original \(\mathbf {X}\) scale by calculating \(\hat{\mathbf {\beta }}_l = \hat{\Sigma }^{1/2} \hat{\mathbf {\eta }}_l,\) \(l=1,\ldots ,d.\)
Principal Hessian directions
 1.
Ybased PHDs: under the linearity and constant variance conditions, PHDs based on the response yield the kernel matrix \(M_{yPHD} = \text {E}\{[Y \text {E}(Y)]\mathbf {Z}\mathbf {Z}^T\} \subset \mathcal {S}_{Y\mathbf {Z}}.\)
 2.
Residualbased PHDs: PHDs based on the residuals yield the kernel matrix \(M_{rPHD} = \text {E}(\epsilon \mathbf {Z}\mathbf {Z}^T),\) where \(\epsilon = Y  \text {E}(Y)  \mathbf {\beta }^\text {T} \mathbf {Z}\) and \(\mathbf {\beta } = \text {Cov}(\mathbf {Z}, Y).\)
PHD has also been used for other complex data problems. For example, Cheng and Li [34] demonstrated the efficacy of using PHD in designed experiments having a large number of factors, with particular attention given to factorial designs and rotatable response surface designs. Lue [35] used PHD in the context of a regression analysis when the predictors are known to have measurement error. Lue et al. [36] showed how an imputedspline modification to PHD yields an effective framework for conducting dimension reduction in survival regressions with censored data.
Flexible modeling with the transformed data
When building a regression model for relating a response to a large number of predictors, researchers often try fitting a multiple linear regression model first. Then, residual diagnostics are assessed to identify potential outliers, high leverage values, and overall goodness of fit. However, a multiple linear regression model is often too restrictive in practice, especially when using survey data from federal databases. Greater flexibility can be achieved using semiparametric regression models, like spline regression, generalized additive models, or partial linear models; see the texts by Ruppert et al. [37] and Härdle et al. [38] for thorough treatments of semiparametric regression modeling. The appropriateness of using such flexible models in big data settings has also been discussed in Oswald and Putka [39] and Young et al. [40].
Flexible models have been used for a wide range of analyses involving survey data from federal databases. For example, Rogers et al. [41] used cubic splines to develop migration models based on data from the ACS. Kniesner and Li [42] developed a male labor supply functions using local linear kernel regression based on panel data from the Survey of Income and Program Participation (SIPP). Gronniger [43] developed a partial linear model relating mortality to body mass index and other health measures using the data from the National Health Interview Survey (NHIS).
Each of the examples just highlighted had a large number of candidate predictor variables available from the respective survey. Many additional variables from these surveys could have been investigated by the authors for their respective model. By employing one of the dimension reduction methods discussed in “Dimension reduction techniques” section, one could develop a model of the response variable Y as a function of the d transformed predictor variables, \(X_l^*=\hat{\mathbf {\beta }}_l^{\text {T}}{} \mathbf X,\) \(l=1,\ldots ,d.\) Then, the estimated model could have better predictive ability. One example for developing such models is principal components regression [44], which involves estimating a multiple linear regression model for the relationship between Y and the \(X_l^*\)s, which were determined using PCA. While using a multiple linear regression model in this setup is conceptually appealing, use of visualizations may suggest the need for greater flexibility in the model. Pairwise scatterplots of Y versus each of \(X_1^*,\ldots ,X^*_d\) might reveal curvature or complex nonlinearities in the relationship between some of the variables, which would suggest the need for a semiparametric regression model.
The above framework is also possible when the data are from complex surveys, where population members are not sampled with equal probability. Determining appropriate survey weights is independent of the flexible modeling strategy employed with the transformed variables. Survey weights can be obtained through traditional approaches, like poststratification and raking, or through more advanced procedures, like the flexible modelbased alternatives proposed in Elliott and Little [45]. These can then be incorporated in a weighted version of the chosen semiparametric regression model, which will usually require solving a surveyweighted least squares problem [46] or implementing something like a surveyweighted backfitting algorithm [47].
Analyzing health insurance coverage using the 2015 planning database
Data
For our analysis, we use the 2015 Planning Database (PDB) [48], a publicly available Census Bureau dataset that contains housing, demographic, socioeconomic, and Census operational data. The variables and counts in the PDB are from the 2010 Census and select 5year estimates from the 2009–2013 ACS. The data are aggregated at the blockgroup level. A census block is the smallest geographic unit used by the Census Bureau, and a block group comprises multiple blocks, usually containing between 600 and 3000 people. The PDB comprises approximately 220,000 block groups.
Three separate response variables are investigated for our analysis: the number of people with no health insurance coverage \((Y_1),\) the number of people with one type of health insurance coverage \((Y_2),\) and the number of people with two or more types of health insurance coverage \((Y_3)\). While these could be treated as a multivariate response, we will analyze three separate models to be consistent with the dimension reduction procedures in “Dimension reduction techniques” section, which were developed assuming a univariate response. A total of 15 variables were identified as relevant candidate predictor variables. The descriptions from the PDB documentation for these variables are given in the Additional files 1, 2.
There are a total of 220,354 records in the 2015 PDB for potential analysis. We first excluded observations from the Commonwealth of Puerto Rico, which is often done due to different laws and demographic considerations involving the Commonwealth; see “Dimension reduction techniques” section of Young et al. [9] for an example of excluding Puerto Rico. The number of Puerto Rico records is 2594, which is about 1.18% of the total number 2015 PDB records. We then omitted records that had missing values for any of the variables under consideration. There are 8754 such records, which is about 3.98% of the total number of 2015 PDB records. This left us with 209,006 records for our analysis. We then transformed the predictors using the maximum likelihood approach of Box and Cox [49] in order to ensure that the linearity condition for dimension reduction is satisfied.
Analysis
Variance inflation factors for the 15 predictor variables
\(X_1\)  4.7301  \(X_2\)  2.5689  \(X_3\)  7.5692  \(X_4\)  16.7708 
\(X_5\)  14.5877  \(X_6\)  2.0651  \(X_7\)  6.6821  \(X_8\)  5.6688 
\(X_9\)  5.7848  \(X_{10}\)  10.4067  \(X_{11}\)  3.8511  \(X_{12}\)  2.0008 
\(X_{13}\)  2.1802  \(X_{14}\)  10.3657  \(X_{15}\)  11.1770 
Cumulative proportions of variability explained using the PCA results
PC1  0.3355  PC2  0.6055  PC3  0.7691  PC4  0.8422 
PC5  0.8750  PC6  0.9055  PC7  0.9294  PC8  0.9456 
PC9  0.9599  PC10  0.9717  PC11  0.9828  PC12  0.9907 
PC13  0.9948  PC14  0.9978  PC15  1.0000 
We use PCA to characterize those principal components explaining the most variation among the dataset. While Johnson and Wichern [16] state that there is “no definitive answer” to determine “how many components to retain,” we proceed to use a scree plot. The scree plot consists of the principal components ordered according to their amount of variability explained on the xaxis and the cumulative proportion of the variability explained on the yaxis. The scree plot for the health insurance data is given in Fig. 1. We use 0.90 as the threshold to determine the number of principal components to select. Using this criterion, we select six principal components, which will be used for comparison with the subsequent analysis. These cumulative probabilities are also reported in Table 2.
Dimensions chosen by the marginal tests for each of the five sufficient dimension reduction methods
Method  \(Y_1\)  \(Y_2\)  \(Y_3\) 

SIR  5  6  3 
Partial SIR  12  12  11 
SAVE  15  14  15 
yPHD  14  15  14 
rPHD  12  15  14 
For each of the sufficient dimension reduction procedures, testing is done to determine the dimensions. These marginal tests, based on the work in Cook [52] and Shao et al. [53], are also available in R. The tests are done sequentially, where we first test 0 dimensions versus 1 dimension, 1 dimension versus 2 dimensions, etc. Based on these tests, the dimensions selected for each of the sufficient dimension reduction procedures are summarized in Table 3. The full test results are given in the Additional files 1, 2.
SAVE, yPHD, and rPHD did not reduce the number of dimensions much or at all. Recall that partial SIR can be used when including categorical variables. A categorical variable was constructed where we partitioned the 50 states and the District of Colombia using the nine Censusdesignated geographical divisions [54]. The inclusion of this categorical predictor only yielded a moderate reduction according to partial SIR. The only sufficient dimension reduction procedure that noticeably reduces the dimension for each of the three responses is SIR. Therefore, the remainder of our analysis will focus on the results from PCA and SIR.

When the response is the number of people with no insurance (\(Y_1\)), the smoothing term corresponding to the fourth dimension is not significant, with an approximate pvalue of 0.102.

When the response is the number of people with one insurance (\(Y_2\)), the smoothing term corresponding to the sixth dimension is not significant, with an approximate pvalue of 0.221.

When the response is the number of people with two or more insurances (\(Y_3\)), all of the smoothing terms are significant.
BIC and adjusted \(R^2\) values for each of the additive model fits using the transformed predictors from PCA and SIR
Method  \(Y_1\)  \(Y_2\)  \(Y_3\) 

PCA  2695444  2906673  2564298 
0.494  0.847  0.495  
SIR  2673141  2862496  2527463 
0.545  0.876  0.577 
We next calculated the Bayesian information criterion (BIC) and adjusted \(R^2\) values to compare the estimated additive models for each response. These results are given in Table 4. For each of the three responses, SIR yields the better BIC and adjusted \(R^2\) values. While these measures do not provide direct comparisons between the models based on the different responses, it is worth noting that the adjusted \(R^2\) values for the models with one insurance as a response (\(Y_2\)) are quite high relative to the other models. This indicates that there is little improvement that could be made to those estimated models by adding another set of PCAtransformed or SIRtransformed predictors.
Finally, we also assess the residuals from the additive models at the state level. Figure 4 provides maps of the United States, where the states have been shaded according to the mean of the residuals from the respective additive model built using the PCAtransformed predictors (maps in the left column) and the SIRtransformed predictors (maps in the right column). The three rows of maps correspond to those models for individuals with no insurance (\(Y_1\)), with only one insurance (\(Y_2\)), and with more than one insurance (\(Y_3\)). Notice that each pair of maps for a given response (i.e., the maps within each row) show similar distributions of the mean residuals at the state level. In particular, the maps corresponding to the additive models for \(Y_1\) (Fig. 4a, b both show the same states with larger positive residuals, which have darker shading. These states include Nevada, Texas, Florida, and Alaska. The maps corresponding to the additive models for \(Y_2\) (Fig. 4c, d both show that regions with larger negative residuals (lighter shading) appear mostly in the Western states while regions with larger positive residuals (darker shading) appear mostly in the Midwest. Finally, the maps corresponding to the additive models for \(Y_3\) (Fig. 4e, f both have shading indicating residuals with overwhelmingly small magnitude. However, the one state indicated with a larger positive residual on both maps is Hawaii. Overall, these maps indicate that both dimension reduction strategies yield similar results for the models built for each of the three responses. Further improvements could be explored using models that, for example, include a spatial component.
Conclusion
Survey data almost always suffers from multicollinearity. When a researcher is interested in building a regressiontype model using survey data, then this is bound to be an issue that they have to address. Granted this is not something unique to survey data, but it is an issue that is almost always present in survey data. Moreover, most survey datasets can be considered big data. Thus, there is a recognizable benefit to using dimension reduction techniques when building regressiontype models with large survey datasets. Specifically, it can help mitigate the problems with multicollinearity as well as help reduce the dimensionality of the predictor variables under consideration.
We demonstrated the benefit of using dimension reduction procedures in the analysis of health insurance coverage data. We clearly showed that SIR provided better estimates over the other dimension reduction techniques investigated, including PCA. The other dimension reduction techniques investigated—partial SIR, SAVE, yPHD, and rPHD—did not reduce the dimensionality much for any of the three models we constructed. However, just like any other statistical analysis where you could have multiple approaches to consider (e.g., different multiple comparisons procedures or different kernel methods), we advocate that the analyst consider each of the different dimension reduction procedures and then proceed to use various metrics and diagnostics to determine the best results. When taking the results from the respective dimension reduction procedure and using them in the model of interest, which for our application was an additive model, we can then use standard criteria. In our analysis, we used the BIC and adjusted \(R^2,\) both of which are wellestablished and accepted criteria for helping to choose between different models and assess goodness of fit. Other diagnostic plots can be constructed, such as those based on the partial residuals of the estimated model. For our application, this strategy resulted in us determining that SIR provided the best fit. From the results, we were then able to model some of the regional differences in terms of healthcare coverage.
Overall, we believe that comparing the estimated models based on different dimension reduction procedures will assist the analyst with determining the best procedure to use for their particular data problem. However, there are a few limitations that should be emphasized. One practical limitation is the availability of software. In our experience, R provides the most extensive collection of dimension reduction procedures available, many of which are in the dr package [56], but not all software have packages devoted to the implementation of dimension reduction. Another limitation is in the utility of PCA. PCA reduces the number of predictors without considering the response variable(s), and choosing the number of principal components is not done through any formal inference paradigm. Thus, the number of principal components must be chosen through a ruleofthumb, while the same principal components must be used if building different models for multiple responses, as was the case for our health insurance analysis. Finally, the only dimension reduction technique we presented that allows for binary or categorical variables is partial SIR. Since survey data tend to have a large number of such variables (e.g., socioeconomic indicators and demographic variables), partial SIR would be the only dimension reduction technique that can be directly applied to the data without requiring the analyst to do some modification to the binary/categorical variables.
Summary
Dimension reduction strategies, like PCA and sufficient dimension reduction, are being increasingly used in the era of big data. However, we believe that they are underutilized in the analysis of survey data from large databases, at least in terms of the published literature. We provided an overview of the more common dimension reduction techniques, followed by how those results can be used in flexible regression models. We then implemented that general strategy to analyze health insurance coverage data from the US Census Bureau’s 2015 PDB.
The quantity of big data will continue to increase over time and this is true for data collected from large surveys. We believe that dimension reduction techniques provide an efficacious strategy for the analysis of survey data. However, it is important to acknowledge some limitations with what we have discussed in this paper.
Principal component analysis is, of course, available in most statistical software and data analytics packages. However, there is currently a limited selection of software for performing sufficient dimension reduction techniques. But as we noted in “Analysis” section, the sufficient dimension reduction techniques we employed were chosen because of their availability in R.
After performing dimension reduction, the resulting principal components or directional vectors help us understand features of the data that explain the most variability. However, the resulting transformed data has a more subjective interpretation. For example, in PCA, suppose demographic variables are the major contributors to the first principal component in an analysis. In this case, the analyst can attribute most of the variability in the data as being driven by demographics. But sometimes the first principal component is comprised of a subset of seemingly unrelated variables, in which case there might not be a clear interpretation.
The number of centers stated is current as of September, 2017.
JW: Identified and prepared background material on the dimension reduction procedures discussed in the manuscript. Wrote all R scripts that were used for the analysis. Assembled summaries of all results. DSY: Identified and provided context for this applied problem. Responsible for interpreting results and identifying appropriate summaries. Responsible for preparation of manuscript. Both authors read and approved the final manuscript.
JW is a Ph.D. student in the Department of Statistics at the University of Kentucky. Her Ph.D. research focuses on novel sufficient dimension reduction methods. DSY is an Assistant Professor in the Department of Statistics at the University of Kentucky. His research interests include mixture modeling, tolerance regions, statistical computing, and applied survey data analysis. Prior to joining the faculty at the University of Kentucky, he spent 3.5 years as a Senior Statistician working on data problems for the Naval Nuclear Propulsion Program and 3 years as a Research Mathematical Statistician at the US Census Bureau working on big data problems, some of which utilized older versions of the Planning Database. DSY is also an Accredited Professional Statistician™ of the American Statistical Association.
We would like to thank Professor Xiangrong Yin of the University of Kentucky for many helpful comments on an earlier draft of this manuscript. We would also like to thank five anonymous reviewers who provided a number of important comments that helped improve the overall quality of this manuscript.
The authors declare that they have no competing interests.
The 2015 PDB is a publicly available Census Bureau dataset located at http://goo.gl/LlcwY7. All R code used to analyze the data is available as Additional files 1, 2.
Not applicable.
Not applicable.
JW was supported as a Research Assistant by NSF Grant SES1562503 throughout the duration of this research. The funding body did not have any role in the design of the study or the collection, analysis, and interpretation of data.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.