Abstract
In this paper, we study dimension folding for matrix/array structured predictors with categorical variables. The categorical variable information is incorporated into dimension folding for regression and classification. The concepts of marginal, conditional, and partial folding subspaces are introduced, and their connections to central folding subspace are investigated. Three estimation methods are proposed to estimate the desired partial folding subspace. An empirical maximal eigenvalue ratio criterion is used to determine the structural dimensions of the associated partial folding subspace. Effectiveness of the proposed methods is evaluated through simulation studies and an application to a longitudinal data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
F. Chiaromonte, R.D. Cook, B. Li, Sufficient dimension reduction in regressions with categorical predictors. Ann. Stat. 30, 475–497 (2002)
R.D. Cook, On the interpretation of regression plots. J. Am. Stat. Assoc. 89, 177–189 (1994)
R.D. Cook, Graphics for regressions with a binary response. J. Am. Stat. Assoc. 91, 983–992 (1996)
R.D. Cook, Regression Graphics: Ideas for Studying Regressions Through Graphics (Wiley, New York, 1998)
R.D. Cook, Testing predictor contribution in sufficient dimension reduction. Ann. Stat. 32, 1062–1092 (2004)
R.D. Cook, S. Weisberg, Discussion of “Sliced inverse regression for dimension reduction”. J. Am. Stat. Assoc. 86, 328–332 (1991)
S. Ding, R.D. Cook, Dimension folding PCA and PFC for matrix-valued predictors. Stat. Sin. 24, 463–492 (2014)
S. Ding, R.D. Cook, Tensor sliced inverse regression. J. Multivar. Anal. 133, 216–231 (2015)
T.R. Fleming, D.P. Harrington, Counting process and survival analysis (Wiley, New York, 1991)
IBM Big Data and Analytics Hub. The Four V’s of Big Data (2014). http://www.ibmbigdatahub.com/infographic/four-vs-big-data
K.-C. Li, Sliced inverse regression for dimension reduction (with discussion). J. Am. Stat. Assoc. 86, 316–342 (1991)
B. Li, S. Wang, On directional regression for dimension reduction. J. Am. Stat. Assoc. 102, 997–1008 (2007)
L. Li, X. Yin, Longitudinal data analysis using sufficient dimension reduction. Comput. Stat. Data Anal. 53, 4106–4115 (2009)
B. Li, R.D. Cook, F. Chiaromonte, Dimension reduction for the conditional mean in regression with categorical predictors. Ann. Stat. 31, 1636–1668 (2003)
B. Li, H. Zha, C. Chairomonte, Contour regression: a general approach to dimension reduction. Ann. Stat. 33, 1580–1616 (2005)
B. Li, S. Wen, L. Zhu, On a projective resampling method for dimension reduction with multivariate responses. J. Am. Stat. Assoc. 103, 1177–1186 (2008)
B. Li, M. Kim, N. Altman, On dimension folding of matrix- or array-valued statistical objects. Ann. Stat. 38, 1094–1121 (2010)
W. Luo, B. Li, Combining eigenvalues and variation of eigenvectors for order determination. Biometrika 103, 875–887 (2016)
R. Luo, H. Wang, C.L. Tsai, Contour projected dimension reduction. Ann. Stat. 37, 3743–3778 (2009)
J.R. Magnus, H. Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics, 2nd edn. (Wiley, New York, 1999)
P.A. Murtaugh, E.R. Dickson, G.M. Van Dam, M. Malinchoc, P.M. Grambsch, A.L. Langworthy, C.H. Gips, Primary biliary cirrhosis: prediction of short-term survival based on repeated patient visits. Hepatology 20, 126–134 (1994)
Y. Pan, Q. Mai, X. Zhang, Covariate-adjusted tensor classification in high dimensions. J. Am. Stat. Assoc. 114, 1305–1319 (2019)
R.M. Pfeiffer, L. Forzani, E. Bura, Sufficient dimension reduction for longitudinally measured predictors. Stat. Med. 31, 2414–2427 (2012)
J.A. Talwalkar, K.D. Lindor, Primary biliary cirrhosis. Lancet 362, 53–61 (2003)
Y. Xia, H. Tong, W. Li, L. Zhu, An adaptive estimation of dimension reduction. J. R. Stat. Soc. Ser. B 64, 363–410 (2002)
Y. Xue, X. Yin, Sufficient dimension folding for regression mean function. J. Comput. Graph. Stat. 23, 1028–1043 (2014)
Y. Xue, X. Yin, Sufficient dimension folding for a functional of conditional distribution of matrix- or array-valued objects. J. Nonparametr. Stat. 27, 253–269 (2015)
Y. Xue, X. Yin, X. Jiang, Ensemble sufficient dimension folding methods for analyzing matrix-valued data. Comput. Stat. Data Anal. 103, 193–205 (2016)
Z. Ye, R.E. Weiss, Using the bootstrap to select one of a new class of dimension reduction methods. J. Am. Stat. Assoc. 98, 968–979 (2003)
Y. Zhu, P. Zeng, Fourier methods for estimating the central subspace and the central mean subspace in regression. J. Am. Stat. Assoc. 101, 1638–1651 (2006)
Acknowledgements
Yin’s work is supported in part by an NSF grant CIF-1813330. Xue’s work is supported in part by the Fundamental Research Funds for the Central Universities in University of International Business and Economics (CXTD11-05).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
8 Appendix
8 Appendix
1.1 8.1 Proofs
The following equivalent relationship will be repeatedly used in the proof of Proposition 1. For generic random variables V 1, V 2, V 3, and V 4, Cook (1998) showed that
Proof of Proposition 1 part (a)
In Eq. (8.1), let
and apply the first part of Eq. (8.1) and the equivalent relationship that , we have
Therefore, under the assumption that
we have \( S_{Y|\circ X} \subseteq S_{Y|\circ X }^{(W)} \), \( S_{Y|X \circ } \subseteq S_{Y|X \circ }^{(W)} \) and \(S_{Y|\circ X \circ } \subseteq S_{Y|\circ X \circ }^{(W)} \).Now in Eq. (8.1), let
and again apply the first part of Eq. (8.1) and the equivalent relationship that , we have
Therefore, under the assumption that
there exist \( S_{Y|\circ X} \subseteq S_{Y|\circ X }^{(W)}\), \( S_{Y|X \circ } \subseteq S_{Y|X \circ }^{(W)} \) and \( S_{Y|\circ X \circ } \subseteq S_{Y|\circ X \circ }^{(W)} \). □
Proof of Proposition 1 part (b)
In Eq. (8.1), let
and apply the first part of Eq. (8.1) and the equivalent relationship that , we have
Therefore, under the assumption that , we also have X). Thus,
and further we have \(S_{Y|\circ {\mathbf X} }^{(W)} \subseteq S_{Y|\circ {\mathbf X}} \), \( S_{Y|{\mathbf X} \circ }^{(W)} \subseteq S_{Y|{\mathbf X} \circ } \) and \( S_{Y|\circ {\mathbf X} \circ }^{(W)} \subseteq S_{Y|\circ {\mathbf X} \circ } \). □
Proof of Proposition 1 part (c)
For generic subspace S L and S R, we have
Since \(S_{Y|\circ {\mathbf X} \circ }^{(W)} = S_{Y|{\mathbf X} \circ }^{(W)} \otimes S_{Y|\circ {\mathbf X} }^{(W)} \), \(S_{Y|\circ {\mathbf X} }^{(W)}\), and \(S_{Y|{\mathbf X} \circ }^{(W)}\) satisfy the left-hand side of Eq. (8.2) by their definitions, they also satisfy
This implies that, for ∀w = 1, …, C, \(S_{Y_w|\circ {\mathbf X}_w} \subseteq S_{Y|\circ {\mathbf X}}^{(W)}\), \( S_{Y_w|{\mathbf X}_w \circ } \subseteq S_{Y|{\mathbf X} \circ }^{(W)} \) and thus \(\oplus _{w=1}^{C} S_{Y_w|\circ {\mathbf X}_w} \subseteq S_{Y|\circ {\mathbf X}}^{(W)} \), \( \oplus _{w=1}^{C} S_{Y_w| {\mathbf X}_w \circ } \subseteq S_{Y|{\mathbf X} \circ }^{(W)} \). Therefore,
Because that \( S_{Y_w| \circ {\mathbf X}_w } \subseteq (\oplus _{w=1}^{C} S_{Y_w|\circ {\mathbf X}_w}) \) and \(S_{Y_w| {\mathbf X}_w \circ } \subseteq (\oplus _{w=1}^{C} S_{Y_w| {\mathbf X}_w \circ })\), for ∀w = 1, …, C, the two direct sum spaces also satisfy the right-hand side of Eq. (8.2). Therefore, we have
This implies the other containing relationship
We then conclude that \( S_{Y|\circ {\mathbf X} \circ }^{(W)} = (\oplus _{w=1}^{C} S_{Y_w| {\mathbf X}_w \circ }) \otimes (\oplus _{w=1}^{C} S_{Y_w|\circ {\mathbf X}_w}) \). □
Proof of Proposition 1 part (d)
For generic subspace S L and S R, we have
Since \(S_{Y|\circ {\mathbf X} \circ }^{(W)} = S_{Y|{\mathbf X} \circ }^{(W)} \otimes S_{Y|\circ {\mathbf X} }^{(W)} \), \(S_{Y|\circ {\mathbf X} }^{(W)}\), and \(S_{Y|{\mathbf X} \circ }^{(W)}\) satisfy the left-hand side of Eq. (8.3) by their definitions, they also satisfy
This implies that, for ∀w = 1, …, C, \( S_{Y_w|\circ {\mathbf X}_w} \subseteq S_{Y|\circ {\mathbf X}}^{(W)} \), \(S_{Y_w|{\mathbf X}_w \circ } \subseteq S_{Y|{\mathbf X} \circ }^{(W)}\) and thus \(S_{Y_w|{\mathbf X}_w \circ } \otimes S_{Y_w|\circ {\mathbf X}_w}\subseteq S_{Y| {\mathbf X} \circ }^{(W)} \otimes S_{Y|\circ {\mathbf X}}^{(W)} = S_{Y|\circ {\mathbf X} \circ }^{(W)}\). Therefore,
Because that \( S_{Y_w| {\mathbf X}_w \circ } \otimes S_{Y_w|\circ {\mathbf X}_w} \subseteq \oplus _{w=1}^{C} (S_{Y_w| {\mathbf X}_w \circ } \otimes S_{Y_w|\circ {\mathbf X}_w} )\) and \(S_{Y_w| {\mathbf X}_w \circ } \otimes S_{Y_w|\circ {\mathbf X}_w} \) satisfy the second right-hand equation (8.3), for ∀w = 1, …, C, we have
where U ∗ is a random basis matrix of space \(\oplus _{w=1}^C span(\beta _w \otimes \alpha _w)\) in \(\mathbb R^{p_l p_r \times k}\). Therefore, by the definition of Kronecker envelope in Li et al. (2010), the Kronecker envelope of U ∗ with respect to integer p l and p r, that is \(\epsilon ^{\otimes }_{p_l, p_r} ({\mathbf U}^*)= S_{{\mathbf U}^* \circ } \otimes S_{\circ {\mathbf U}^*} \), satisfies the following conditions:1. \(span({\mathbf U}^*) = \oplus _{w=1}^{C} (S_{Y_w| {\mathbf X}_w \circ } \otimes S_{Y_w|\circ {\mathbf X}_w} ) \subseteq S_{{\mathbf U}^* \circ } \otimes S_{\circ {\mathbf U}^*}\) almost surely.2. If there is another pair of subspaces \(S_R \in \mathbb R^{p_r}\) and \(S_L \in \mathbb R^{p_l}\) that satisfies condition 1, then \(S_{{\mathbf U}^* \circ } \otimes S_{\circ {\mathbf U}^*} \subseteq S_R \otimes S_L\).However, from the previous proof
and by definition, \(S_{Y|{\mathbf X} \circ }^{(W)} \in \mathbb R^{p_r}\) and \(S_{Y|\circ {\mathbf X}}^{(W)} \in \mathbb R^{p_l}\). Therefore,
On the other hand, for ∀w = 1, …, C,
Therefore, \(S_{\circ {\mathbf U}^*}\) and \(S_{{\mathbf U}^* \circ }\) satisfy the second right-hand side of Eq. (8.3). And for the left-hand side of Eq. (8.3), we have
Thus \( S_{Y|\circ {\mathbf X} }^{(W)} \subseteq S_{\circ {\mathbf U}^*} \) and \( S_{Y|{\mathbf X} \circ }^{(W)} \subseteq S_{{\mathbf U}^* \circ } \), which implies the relationship
Therefore,
This concludes that \(S_{Y|\circ {\mathbf X} \circ }^{(W)}\) equals to the Kronecker envelope of \(\oplus _{w=1}^{C} (S_{Y_w| {\mathbf X}_w \circ } \otimes S_{Y_w|\circ {\mathbf X}_w})\). Thus by estimating \(\oplus _{w=1}^{C} (S_{Y_w| {\mathbf X}_w \circ } \otimes S_{Y_w|\circ {\mathbf X}_w})\), we are targeting a proper subspace of \(S_{Y|\circ {\mathbf X} \circ }^{(W)}\). For example, estimation on \(\oplus _{w=1}^{C} (S_{Y_w| {\mathbf X}_w \circ } \otimes S_{Y_w|\circ {\mathbf X}_w})\) does not recover \(S_{Y|\circ {\mathbf X} \circ }^{(W)}\) exhaustively. □
Proof of Proposition 1 part (e)
First note that if for each W = w, \(span({\mathbf U}_w) \subseteq S_{Y_w | vec({\mathbf X})_w}\) almost surely, then from part (d) of Proposition 1, we have
almost surely. Therefore, by the definition of Kronecker product, we have
□
Proof of Theorem 1
Using double expectation formula, we can further write the objective function as
where the inside expectation is with respect to random matrices U 1, …, U C and the outside expectation is with respect to categorical variable W. This is equivalent to
Assume 𝜖 ⊗(U ∗) = span(β 0 ⊗ α 0). Because that, for each W = w, \( {\mathbf U}_w \subseteq \oplus _{w=1}^C span({\mathbf U}_w)\subseteq {\epsilon }^{\otimes } ({\mathbf U}^*) = span(\beta _0 \otimes \alpha _0) \) and the elements of U w are measurable with respect to Z, there exists a random projection matrix \(\phi _w(Z) \in L^{ d_l d_r \times k_w}\) such that U w = (β 0 ⊗ α 0)ϕ w(Z), which is equivalent to A U w = A(β 0 ⊗ α 0)ϕ w(Z).
Thus (4.3), or equivalently (8.4), reaches its minimum 0 within the range of (α, β, f 1, …, f C) given in the theorem. This implies that any minimizer \((\alpha ^*, \beta ^*, f_1^*,\ldots , f_C^*)\) of (4.3) must satisfy \(A(\beta ^* \otimes \alpha ^* ) f_w^* (Z) = A{\mathbf U}_w\) almost surely for every W = w and, consequently, (β 0 ⊗ α 0)ϕ w(Z) = (β ∗⊗ α ∗)f w ∗(Z) almost surely. But this means that span(β ∗⊗ α ∗) contains each U w almost surely; thus we have \(\oplus _{w=1}^C span({\mathbf U}_w) \subseteq span(\beta ^* \otimes \alpha ^*)\). Since span(β ∗⊗ α ∗) has the same dimensions as 𝜖 ⊗(U ∗), the theorem now follows from the uniqueness of the Kronecker envelope. □
1.2 8.2 Additional Simulation and Data Analysis
The following six examples are related to the examples in Sect. 6.1 of simulation studies, showing how the results changes when overlap information changes among individual subspaces.
Example A1
Example A1 almost keep the same experimental setting as in Example 1 but slightly change the conditional distribution of Y given X and W, so that the two conditional central subspaces are overlapped but not identical.
In this example,
The two conditional folding subspace are overlapped, because their left conditional folding subspaces are the same and their right conditional folding subspaces also share one same direction. By part (c) of Proposition 1, we have:
On the other hand, based on part (d) of Proposition 1, we have:
Therefore, all three methods can still recover \(S_{Y|{\circ }X{\circ }}^{(W)}\) exhaustively. Again, for vectorized data,
thus
Table 7 summarizes the simulation results for Example A1. Note that all three methods perform worse than Example 1 in terms of accuracy and variability. This is due to the fact that they are estimating a bigger partial folding subspace than in Example 1. We observe that individual ensemble method and LSFA method still provide similar accuracy and they outperform objective function method.
Example A2
Example A2 keeps the two conditional central subspaces overlapped but to a smaller extent. This can be achieved by setting conditional distribution as
In this example,
The two conditional folding subspace are slightly overlapped, but none of their left (right) conditional folding subspaces are the same. By part (c) of Proposition 1, we have:
On the other hand, based on part (d) of Proposition 1, we have:
In this case, only individual ensemble method and objective function method recover \(S_{Y|{\circ }X{\circ }}^{(W)}\) exhaustively. Since LSFA method is targeting on space \(S_{Y|{\circ }X_{w=0}{\circ }} \oplus S_{Y|{\circ }X_{w=1}{\circ }}\), it is a smaller subspace according to part (d) of Proposition 1 and the experiment setting above. Therefore, LSFA method estimates a smaller subspace than the desired partial folding subspace \(S_{Y|{\circ }X{\circ }}^{(W)} \). In practice, the accuracy of LSFA method may not be defected since we use the results from individual ensemble method as initial values. Again, for vectorized data,
thus
Table 8 displays the simulation results for Example A2, where individual ensemble method and LSFA method still outperform objective function method.
Example A3
Example A3 constrains that the two conditional central subspaces are as the same as the conditional folding subspaces, i.e., for any w = 0, 1,
However, the partial central subspace is still a proper subspace of partial folding subspace. We can achieve this by constraining the two partial central subspaces to be orthogonal with each other. The conditional distribution of Y given X and W, that is:
In this case,
Thus,
Based on part (c) of Proposition 1, we have:
Still, the LSFA method only targets at \(S_{Y|{\circ }X_{w=0}{\circ }} \oplus S_{Y|{\circ }X_{w=1}{\circ }}\), which is a proper subspace of our desired space \(S_{Y|{\circ }X{\circ }}^{(W)}\).
Simulation results for this example in Table 9 indicate that both individual ensemble method and LSFA method performs similarly.
Example A4
In Example A4, we constrain that both the conditional central subspaces and partial central subspaces are as the same as conditional folding subspace and partial folding subspace, respectively. Since estimating partial folding subspace will greatly reduce number of parameters especially when the dimension is larger, here in this example, we are specifically interested in whether folding-based method can achieve higher accuracy than traditional methods such as partial SIR. We modify the conditional distribution as:
In this case,
And most importantly,
Again, the results from Table 10 reflects individual and LSFA perform better than the other two approaches.
Figure 3 summarizes the first two examples in the paper and Examples A1–A4. The three estimation methods can be interpreted as this: Since partial folding subspace \(S_{Y|{\circ }X{\circ }}^{(W)}\) with its basis matrix must be presented as a Kronecker product, it can only be covered by a “rectangle space.” Therefore, exhaustive methods including individual ensemble method and objective function method attempt to find one minimal “rectangle space” that covers both of the conditional folding subspaces. On the other hand, LSFA method estimates \(\oplus S_{Y|{\circ }X_{w}{\circ }}\), which look for two minimal “rectangles” which cover all the conditional folding subspaces, thus can be smaller than partial folding subspace. Traditional partial central subspace \(S_{Y| vec({\mathbf X}) } ^{(W)}\), which stack the columns together, and its estimation method partial SIR look for “blocks” which cover all the conditional central subspaces.
Example A5
Example A5 follows closely from Example 3, which intends to construct corresponding partial folding subspaces to be exactly the same as that of Examples 2 and A1. We illustrate the details of the experiment setting as follows:For W = 0, it follows exact same setting as in Example 3.For W = 1, however, the condition mean of X given Y is changed to:
Correspondingly, the conditional covariance structure stay the same as in Example 3 except the index set A = {(1, 3), (2, 2)}. We can easily verify that the desired partial folding subspace \(S_{Y|{\circ }X{\circ }}^{(W)}\) is the same as in Example A1. But for vectorized data vec(X),
and
thus
From Table 11, it appears that the objective function method with pooled variance provides smallest errors and smallest variability across all different sample size n. The individual direction ensemble method and LSFA method produce similar accuracy and stableness in Example A5.
Example A6
Example A6 follows closely from Example 3, which intends to construct corresponding partial folding subspaces to be exactly the same as that of Example 2 and Example A1. In this example, its corresponding two conditional folding subspaces are less overlapped, leading to a larger partial folding subspace. For W = 0, it follows exact same setting as in Examples 3 and A5.For W = 1, however, the condition mean of X given response Y is changed to:
Correspondingly, the conditional covariance structure stay the same as in Example 3 except the index set A = {(2, 3), (3, 2)}. We can easily verify that the desired partial folding space \(S_{Y|{\circ }X{\circ }}^{(W)}\) is the same as in Example A1. But for vectorized data vec(X),
and
thus
The results are listed in Table 12; similarly, the proposed individual ensemble method and LSFA method outperform the third estimation method objective function optimization method, but the objective function optimization method with pooled covariance yields smallest error and standard deviations.
Three Histograms for the Real Data
See Fig. 4.
The Bootstrap Confidence Interval Plots for Real Data
See Fig. 5.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Wang, Y., Xue, Y., Yuan, Q., Yin, X. (2021). Sufficient Dimension Folding with Categorical Predictors. In: Bura, E., Li, B. (eds) Festschrift in Honor of R. Dennis Cook. Springer, Cham. https://doi.org/10.1007/978-3-030-69009-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-69009-0_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69008-3
Online ISBN: 978-3-030-69009-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)