Likelihood-based surrogate dimension reduction

We consider the problem of surrogate sufficient dimension reduction, that is, estimating the central subspace of a regression model, when the covariates are contaminated by measurement error. When no measurement error is present, a likelihood-based dimension reduction method that relies on maximizing the likelihood of a Gaussian inverse regression model on the Grassmann manifold is well-known to have superior performance to traditional inverse moment methods. We propose two likelihood-based estimators for the central subspace in measurement error settings, which make different adjustments to the observed surrogates. Both estimators are computed based on maximizing objective functions on the Grassmann manifold and are shown to consistently recover the true central subspace. When the central subspace is assumed to depend on only a few covariates, we further propose to augment the likelihood function with a penalty term that induces sparsity on the Grassmann manifold to obtain sparse estimators. The resulting objective function has a closed-form Riemann gradient which facilitates efficient computation of the penalized estimator. We leverage the state-of-the-art trust region algorithm on the Grassmann manifold to compute the proposed estimators efficiently. Simulation studies and a data application demonstrate the proposed likelihood-based estimators perform better than inverse moment-based estimators in terms of both estimation and variable selection accuracy.


Introduction
In a regression setting with an outcome y ∈ R and a covariate vector X ∈ R p , sufficient dimension reduction (SDR) refers to a class of methods that express the outcome as a few linear combinations of X (Li, 2018).In other words, SDR aims to estimate the matrix B ∈ R p×d such that y ⊥ X | B T X, where d ≪ p.As the matrix B is generally not unique, the estimation target in SDR is the central subspace S y|X , defined as the intersection of all the subspaces spanned by the columns of B satisfying the above conditional independence condition.This central subspace is unique under mild conditions (Glaws et al., 2020), and is characterized by the projection matrix associated with B.
This article focuses on the problem of estimating the central subspace S y|X when the covariates X are measured with error.This problem is known in the literature as surrogate sufficient dimension reduction (Li and Yin, 2007).Instead of observing a sample for X, we observe a sample of W, which are surrogates related to X by the classical additive measurement error model W = X + U, where U is a vector of measurement errors independent of X and which follows a p-variate Gaussian distribution with mean zero and covariance matrix Σ u .While more general measurement error models have also been proposed in the literature e.g., W = AX + U for A ∈ R r×p with r ≥ p (Carroll and Li, 1992;Li and Yin, 2007;Zhang et al., 2014), the classical measurement error model with A = I p is still the most widely used in practice (see Grace et al., 2021;Chen and Yi, 2022;Chen, 2023, for some recent examples), and is the focus of this paper.Throughout the article we assume Σ u is known; in practice, this covariance matrix may be estimated from auxiliary data, such as replicate observations, before subsequent analyses are conducted (Carroll et al., 2006).Our observed data thus consists of n pairs (y i , W T i ) with W i = X i + U i , i = 1, . . ., n.When the true covariates X i are observed, there exists a vast literature on how to estimate the central subspace; see Li (2018) for an overview.These include traditional inverse moment-based methods such as sliced inverse regression (Li, 1991) and sliced average variance estimation (Cook, 2000), forward regression methods such as minimum average variance estimation (Xia et al., 2002) and directional regression (Li and Wang, 2007), and inverse regression methods such as principal fitted components (Cook and Forzani, 2008) and likelihood-acquired directions (LAD, Cook and Forzani, 2009).Recent literature has also expanded these methods into more complex settings, such as high-dimensional data (e.g., Lin et al., 2021;Qian et al., 2019) and longitudinal data (e.g., Hui and Nghiem, 2022).Among the estimators discussed above, the LAD method developed by Cook and Forzani (2009) has a unique advantage in that it is constructed from a well-defined Gaussian likelihood function and, as such, inherits the optimality properties of likelihood theory.Also, while the conditional Gaussianity assumption appears unnatural, Cook and Forzani (2009) show that the LAD estimator has superior performance even when Gaussianity does not hold; roughly speaking, the conditional normality plays the role of a "working model."In this article, we focus on adapting LAD to the surrogate SDR problem.
The problem of surrogate SDR was first considered in Carroll and Li (1992), who showed that S y|X can be estimated by performing ordinary least squares, or sliced inverse regression, of the response on the adjusted surrogate X * = RW, where Here, Σ xw = Cov(X, W) denotes the covariance matrix of X and W and Σ w = Var(W) denotes the variance-covariance matrix of W. When Σ u is known, these adjusted surrogates can be computed by replacing Σ w with an appropriate estimator from the sample.This underlying idea was expanded into a broader, invariance law by Li and Yin (2007), who prove that if both X and U follow multivariate Gaussian distributions, then S y|X * = S y|X .As a result, under the assumption of Gaussianity of both X and U, any consistent SDR method applied to y and the adjusted surrogate X * is also consistent.If X is non-Gaussian, this relationship is maintained if B T X is approximately Gaussian.For instance, this is achieved when p → ∞ and each column of B is dense i.e., the majority of elements in each column are non-zero.On the other hand, when the number of covariates is large, it is often assumed that the central subspace is sparse (i.e., it depends only on a few covariates Lin et al., 2019), and so this Gaussian approximation may not be realistic or relevant in practice.Elsewhere, Zhang et al. (2012)  The rest of the paper is organized as follows: In Section 2, we briefly review the LAD estimator of the central subspace, which is built upon a Gaussian inverse regression model.Then we incorporate measurement errors and discuss the properties of the corresponding model.We propose maximum likelihood estimators of the central subspace from the surrogate data in Section 3. Section 4 presents simulation studies to demonstrate the performance of the proposed estimators, while Section 5 illustrates the methodology in a data application.Finally, Section 6 contains some concluding remarks.

A Gaussian inverse regression model with measurement errors
We first review the LAD method proposed by Cook and Forzani (2009), when the true covariate vector X is observed.LAD models the inverse predictor X | y as having a multivariate Gaussian distribution with mean µ y and covariance ∆ y i.e., X | y ∼ N (µ y , ∆ y ).Importantly, the existence of the central subspace S y|X imposes some constraints on the parameters µ y and ∆ y .Let µ = E(X), ∆ = E(∆ y ) = E {Var(X | y)}, and let Ψ ∈ R p×d be a semi-orthogonal basis matrix for S y|X and d) be a matrix, such that (Ψ, Ψ 0 ) ∈ R p×p is an orthogonal matrix.Then S y|X is a central subspace if and only if the two following conditions are satisfied: where While condition (i) allows the conditional distribution of Ψ T X | y to depend on y, condition (ii) requires that the conditional distribution Ψ T 0 X|(Ψ T X, y) does not depend on y.This aligns with the intuition that Ψ T X is a sufficient predictor for y.Cook and Forzani (2009) also provide several equivalent characterisations of the two conditions in (1), for example, Ψ T 0 ∆ −1 y = Ψ T 0 ∆ −1 and µ y − µ = ∆Ψv y .The advantage of the conditions in equation ( 1) is that a loglikelihood function can be constructed and we can maximize it to consistently estimate all the parameters of the LAD model.As reviewed in Section 1, Cook and Forzani (2009) highlighted that Gaussianity of the inverse predictor X | y is not essential for LAD.Indeed, their simulation results show that the LAD estimator has superior performance to other SDR methods in terms of recovering the central subspace, even when the Gaussianity assumption is not satisfied.
When measurement errors that follow the classical additive measurement error model are present in the covariates so X is replaced by W, condition (ii) in (1) is no longer satisfied.Specifically, we can straightforwardly show that the conditional mean of . This quantity generally depends on y, and so Ψ is no longer guaranteed to be a sufficient dimension reduction of the central subspace S y|W .By a similar argument, condition (ii) is also not satisfied when X is replaced by X = RW, with R = Σ x Σ −1 w .We remark that this result does not contradict the invariance law of Li and Yin (2007), since that was established under the assumption that the marginal distribution of X is Gaussian.Here, the Gaussian inverse regression model assumes only that the conditional distribution of X | y is Gaussian.
To overcome the challenges of inverse regression-based SDR with measurement error, we propose a new predictor of X from the observed surrogate W which satisfies similar properties to those in (1), so that we can apply inverse regression based SDR.
To this end, let V = LW with L = ∆(∆ + Σ u ) −1 .The the result below ensures that if Ψ is a semi-orthogonal basis matrix for S y|X , then it is also a semi-orthogonal basis matrix for S y|V .
The conditional distributions Ψ T V|y and Ψ T 0 V|(Ψ T V, y) are given by The proof of Proposition 1 and all the other theoretical results can be found in Section ??.In summary, V is purposefully constructed such that the conditional mean of Ψ T V|y depends on y, but the conditional mean of Ψ T 0 V|(Ψ T V, y) does not.
3 Maximum likelihood estimation

Corrected LAD estimator
We can now consider the problem of estimating the central subspace S y|X from the data (y i , W T i ), i = 1, . . ., n.Similar to LAD and other SDR methods based on inversemoments, we first partition the data into M non-overlapping slices based on the outcome data y i , for i = 1, . . ., n.If the outcome is categorical, each slice corresponds to one category.If the outcome is continuous, these slices are constructed by dividing its range into M non-overlapping intervals.Let S m ⊂ {1, . . ., n} denote the index set of observations and n m be the number of observations in the mth slice, m = 1, . . ., M .Next, assume the true covariate data within each slice are mutually independent and identically distributed, such that ∼ N (0, Σ u ), and we use the superscript m to index the slice to which the ith observation belongs.Furthermore, we set E µ i .Then we will estimate the central subspace of S y|X by maximizing the joint log-likelihood of and m = 1, . . ., M .Theorem 1 below gives the explicit form of this log-likelihood function when the dimension d is known.
Theorem 1.Let Ψ ∈ R p×d be a semi-orthogonal basis matrix for S y|X .Then the profile log-likelihood function for Ψ and ∆ from the observed data is given by where The profile likelihood in Theorem 1 depends on both Ψ and ∆, so maximizing it is challenging.We propose to estimate ∆ first using a method-of-moments approach as follows: First, ignoring the measurement errors, compute the naïve LAD estimator Ŝn on the observed data, (y i , W T i ), i = 1, . . ., n.Let Ψn denote a subsequent orthogonal basis of Ŝn .Then an estimate of E {Var(W|y)} is given by ∆n = , and we can estimate ∆ by ( ∆n − Σ u ).Replacing ∆ by ∆ in (3) and removing terms that do not depend on Ψ, we then where L = ∆(∆ + Σ u ) −1 .The right-hand side of ( 4) is invariant to rotation of Ψ by an orthogonal matrix A ∈ R d×d .That is, for any matrix we have ℓ 2 (AΨ) = ℓ 2 (Ψ).Therefore, the function in ( 4) is actually a function of the column space of Ψ, which is characterized by the projection matrix P S = ΨΨ T .
Similar to LAD then, maximisation of ( 4) is performed over the Grassmann manifold S ∈ G d,p ⊂ R p , which is the subspace spanned by any p × d basis matrix.Specifically, the function in (4) can be written as where for any square matrix A, we use ∥A∥ 0 to denote the product of its non-zero eigenvalues.The corresponding Euclidean gradient of this objective function is given by This closed-form gradient function facilitates the use of a trust region algorithm on the Grassmann manifold developed by Absil et al. (2007).Starting from an initial solution Ŝ(0) , the algorithm finds the next candidate Ŝ(1) by first finding a solution to minimize a constrained objective function on the tangent space at Ŝ(0) , and then maps this solution to G d,p .The constraints imposed on the tangent space are known as a trust region.The decision to accept the candidate and to expand the region or not is based on a quotient; different values of the quotient lead to one out of three possibilities: (i) accept the candidate and expand the trust region, (ii) accept the candidate and reduce the trust region, or (iii) reject the candidate and reduce the trust region.This procedure is carried out until convergence; see Gallivan et al. (2003) and Absil et al. (2007) for further details.We use the Manopt package (Boumal et al., 2014) in Matlab, which is a dedicated toolbox for optimization on manifolds and matrices, to maximize (5).In this implementation, there is no need to provide the Hessian of ℓ 2 (S), since the Manopt package automatically incorporates a numerical approximation for this matrix based on finite differences in the trustregion solver.
For the remainder of this article, we refer to the solution of this problem, and hence our proposed estimator, as the corrected LAD estimator (cLAD) of the central subspace.

Invariance-law LAD estimator
The cLAD estimator is constructed by maximizing a likelihood function involving the adjusted surrogate In this section, we consider a likelihood-based estimator for the central subspace based instead on maximizing a likelihood function involving X * = Σ x Σ −1 w W, the adjusted covariate introduced in the invariance law of Li and Yin (2007).From the observed data, a sample of the adjusted covariate X * can be constructed by The equivalence between S y|X and S y|X * motivates applying LAD to (y i , X * i ), to obtain the invariance-law LAD (IL-LAD) estimator Ψ * which maximizes the objective function where Σ * and ∆ * m denote the sample marginal covariance of X * and the sample covariance of X * within the mth slice, respectively.Similar to the cLAD estimator, maximization of (3.2) is also performed over the Grassmann manifold S ∈ G d,p .The main difference between the estimators is that the cLAD estimator uses the conditional covariance ∆ = E {Var(X | y} and the IL-LAD estimator uses the marginal covariance Σ x to construct the adjusted covariate.
Nevertheless, in the following subsection, we will prove that the IL-LAD and cLAD estimator are asymptotically equivalent.In finite samples, we found that the difference between the two estimators is often negligible.

Consistency
In this section, we establish the consistency of both the LAD and IL-LAD estimators, and show them to be asymptotically equivalent.We focus on the setting when d is known and p is fixed, similar to Cook and Forzani (2009).Assuming the true covariates X are observed, Cook and Forzani (2009) proved the consistency of the LAD estimator by establishing the equivalence between the population subspace spanned by LAD and that spanned by the sliced average variance estimator i.e., the true SAVE estimator.
We use a similar argument here and prove that, when n → ∞, the subspace spanned by either the cLAD or IL-LAD estimators is equivalent to that of the SAVE estimator when the true covariates X are observed.
For any subspace S, the function n −1 ℓ 2 (S) from (5) converges to K 2 (S) = log ∥P S (L(Σ+ y +Σ u )L T P S ∥ 0 , where f m = n m /n.The population cLAD subspace is then defined to be S * LAD = arg max S K 2 (S).Similarly, the subspace spanned by the IL-LAD estimator converges to the population IL-LAD subspace The main result below establishes the equivalence between the subspaces spanned by the two proposed estimators and that spanned by the true SAVE estimator.As noted in Section 3.2, the main difference between the cLAD and the IL-LAD estimators is the use of ∆ versus Σ x in the construction of the adjusted surrogate.
Nevertheless, asymptotically, they are both equivalent to the true SAVE estimator, since the orthogonal complement Φ 0 corresponding to the SAVE estimator satisfies Cook and Forzani (2009), Theorem 2 does not require any distributional assumptions on the model, and only depends on the properties of positive definite matrices and the concavity of the log determinant function.Nevertheless, the result implies that the cLAD estimator is consistent whenever the true LAD estimator and SAVE (i.e., those computed assuming X were known) are consistent for the central subspace S y|X .As shown in Li and Wang (2007) and Cook and Forzani (2009), these two estimators are consistent under a linearity and constant covariance condition.

Sparse surrogate dimension reduction
When the number of covariates is large, it is typically assumed that the sufficient predictors B T X depend on only a few covariates (e.g., Qian et al., 2019).In other words, the true matrix B is row-sparse.Since only the column space of B is identifiable from the samples, we translate the sparsity of B into the elementwise sparsity of the projection matrix, and impose a corresponding penalty to achieve sparse solutions.
Below, we formalize this idea with the proposed cLAD estimator, although an analogous procedure can be applied to the IL-LAD estimator.
We propose to maximize the regularized objective function where for any matrix A with elements a ij , we denote where I p 2 is the identity matrix of dimension p 2 × p 2 and ⊗ denotes the Kronecker product.As a result, the Euclidean gradient of the l2 (S) at any semi-orthogonal matrix Ψ is ∇ l2 (S) = ∇ℓ 2 (S) − λivec vec ΨΨ T 1 /∂Ψ , and the corresponding Riemann gradient is grad l2 (Ψ) = (I p − ΨΨ T )∇ l2 (Ψ).These Euclidean and Riemann gradients are used in the same trust region algorithm of Absil et al. (2007) to maximize l2 (S).We computed the maximizer for (6) on a grid of the tuning parameter λ that consists of 40 logarithmically equally-spaced values between 0 and λ max , where λ max is the value where the maximizer P Ŝ of ( 6) is approximately close to the identity matrix.
In our experiment, we found that setting λ max to 1 leads to a good performance and that the algorithm converges quickly to a stable solution at each value of the tuning parameter λ in the grid.
Finally, to select the optimal tuning parameter λ, we use a variant of the projection information criterion (PIC) developed by Nghiem et al. (2022).Let Ŝ0 be an initial consistent, non-sparse estimator of the central subspace, and let Ŝ(λ) be the maximizer of (6) associated with λ.We propose to select λ by minimizing PIC(λ) = ∥P Ŝ(λ) − , where df(P Ŝ(λ) ) = s λ (s λ − d) with s λ being the number of non-zero, diagonal elements of P Ŝ(λ) .The first term in PIC(λ) is a measure of goodness-of-fit, while the second term is a measure of complexity.Compared to the information criterion developed in Nghiem et al. (2022), this modified criterion has a different complexity term so as to more effectively quantify the number of parameters of the corresponding Grassmann manifold.In our numerical studies, we choose the initial consistent estimator Ŝ0 to be the estimated central subspace corresponding to the unregularized cLAD estimator, and find that this choice usually leads to good variable selection performance.For the remainder of this article, we refer to the maximizer of ( 6), with the tuning parameter selected via this projection criterion, as the sparse corrected LAD (scLAD) estimator.

Simulation studies
We conduct a numerical study to examine the performance of the proposed cLAD and IL-LAD estimators in finite samples.We generate the true predictors and outcome from the following four single/multi-index models (i) y i = (0.5)(X T i β 1 ) 3 + 0.25 |X T i β|ε i , (ii) y i = 3(X T i β 1 )/(1 + X T i β 1 ) 2 + 0.25ε i , (iii) y i = 4 sin(X T i β 2 /4) + 0.5(X T i β 1 ) 2 + 0.25ε i , and (iv) y i = 3(X T i β 1 ) exp(X T i β 2 + 0.25ε i ), with ε i ∼ N (0, 1), and W i = X i + U i for i = 1, . . ., n.The single index models (i) and (ii) are similar to those considered in Lin et al. (2019) and Nghiem et al. (2022), while the multiple index models (iii) and (iv) are similar to those considered in Reich et al. (2011).The true central subspace in models (i) and (ii) is the subspace spanned by B = β 1 , while in models (iii) and (iv) it is the column space of B = [β 1 , β 2 ].Moreover, we set β 1 = (1, 1, 1, 0, . . ., 0) T and β 2 = (0, 0, 1, 1, 1, 0, . . ., 0) T such that the true central subspace for the single index models depends only on the first three covariates, while for the multiple index models it depends only on the first five covariates across two indices.
Next, we generate the true predictors X i from one of the following three choices: (1) a p-variate Gaussian distribution N (0, Σ x ), (2) a p-variate t distribution with three degrees of freedom and the same covariance matrix Σ x , and (3) a p-variate half-Gaussian distribution |N (0, Σ x )|, where for all three choices we set the covariance matrix to have an autoregressive structure σ xij = 0.5 |i−j| .Turning to the measurement error, we generate U i from a multivariate Gaussian distribution N (0, Σ u ), where Σ u is set to a diagonal matrix with elements drawn from a uniform U (0.2, 0.5) distribution.The sample size n is set to either 1000 or 2000, while the number of covariates p is set to either 20 or 40.We assume Σ u and the structural dimension d are known.
It is important to highlight that the data generation processes for these simulation studies are not the same as those imposed by the Gaussian inverse regression models.
Particularly, with these simulation configurations, the conditional distribution X i | y i is generally not Gaussian due to the presence of non-linear link functions.As such, with these simulation configurations, the conditional Gaussianity only plays the role of a "working model."This type of data generation process is also used in Cook and Forzani (2009) For each simulated dataset, we compute the two unregularized likelihood-based estimates, cLAD and IL-LAD, and compare them with several invariance-law inverse- We assess the performance of all estimators based on the Frobenius norm of the difference between the projection matrix associated with the true central subspace, and that of the estimated central subspace.For the sparse estimators, we assess variable selection performance based on the following metric.Let P denote the projection matrix associated with the true directions, i.e., P = B B T B −1 B T , and let P be an estimator of P. We then define the number of true positives (TP) to be the number of non-zero diagonal elements that are correctly identified to be non-zeros, the number of false negatives (FN) to be the number of non-zero diagonal elements that are incorrectly identified to be zeros and the number of false positives (FP) to be the number of zero diagonal elements that are incorrectly identified to be non-zeros.We also compute the F1 score as 2TP/(2TP + FP + FN).This metric ranges from zero to one, with a value of one indicating perfect variable selection.
For brevity, we present the projection error and F1 variable selection results for p = 40 below; the results for p = 20 and including FP and FN selection results offer overall similar conclusions to those seen below and are deferred to the Supplementary Materials.Table 1 demonstrates that among the unregularized estimators, the likelihood-based estimators cLAD and IL-LAD have smaller estimation error in the majority of settings compared to the other, invariance-law inverse-moment-based estimators.This result is consistent with Cook and Forzani (2009) who demonstrated that LAD in general tends to exhibit superior performance relative to other inversemoment-based estimators like SIR and SAVE when no measurement error is present.
The performance of all the considered estimators tends to deteriorate when the true covariates X deviate from Gaussianity, such as when they are skewed or have heavier tails.There is negligible difference in the performance between the cLAD and IL-LAD estimator.Next, the left half of Table 2 demonstrates the estimation performance of the sparse SDR estimators.In terms of projection error, and analogous to the results with unregularized methods above, the proposed scLAD estimator tends to have lower estimation error than the IL-Lin and IL-Tan estimators.The improvement is most pronounced in the two multiple index models, i.e.Model (iii) and Model (iv), and when the true covariate-vector X follows a Gaussian distribution.Again similar to the unregularized estimators, the performance of these sparse estimators deteriorates when the true covariates X deviate from Gaussianity, especially when they follow a heavy tail distribution such as t 3 .Turning to the variable selection results, the scLAD estimator had the best overall selection performance among the three considered estimators.When the true covariate vector X follows a Gaussian distribution, the scLAD estimator selects the true set of important covariates across all considered settings, as reflected in the corresponding F1 scores all being exactly equal to one.
This conclusion still holds when the true covariates follow a half-Gaussian distribution and under the single index models (i) and (ii).However, for the other settings such as the multiple index models (iii) and (iv), scLAD tends to have lower F1 scores than IL-Tan for n = 1000, although the trend is reversed when the sample size increases to n = 2000.By contrast, the IL-Lin estimator consistently has a very low F1 score, with additional results in the Supplementary Materials demonstrating that this estimator incurs too many false positives.This reflects the advantage of imposing regularisation directly on the diagonal elements of the corresponding projection matrix, compared to doing so on each dimension separately.However, we acknowledge that the variable selection results do depend on how the tuning parameter is chosen, and we are not aware of any method that is guaranteed to achieve selection consistency when we have to estimate the central subspace from surrogates.

Survey data
We apply the proposed methodology to analyze a dataset from the National Health and Nutrition Examination Survey (NHANES) (Centers for Disease Control and Prevention, 2022).This survey aims to assess the health and nutritional status of people in the United States and track the evolution of this status over time.During the 2009-2010 survey period, participants were interviewed and asked to provide their demographic background as well as information about nutrition habits, and to undertake a series of health examinations.To assess the nutritional habits of participants, dietary data were collected using two 24-hour recall interviews wherein the participants self-reported the consumed amounts of a set of food items during the 24 hours prior to each interview.
Based on these recalls, daily aggregated consumption of water, food energy, and other nutrition components such as total fat and total sugar consumption were computed.
In this application, we focus on the relationship between participants' total choles-terol level (y), their age (Z), and their daily intakes of 42 nutrition components, such as sugars, total vitamins, fats, retinol, lycopene, zinc, and selenium, among many others.We restrict our analysis to n = 3343 women, and assume participants' ages are measured accurately while the daily intakes of nutrition components are subject to additive measurement errors.For the ith participant, let W i1 and W i2 denote the 43 × 1 vector of surrogates at the first and second interview, respectively.For each vector, the first element is the age (Z i ) and the remaining elements are the recorded values for nutrition components at the corresponding interview time.We assume the classical measurement error model W ij = X i + U ij for i = 1, . . ., n and j = 1, 2, where X i denotes the vector of the long-term nutrition intakes, and Therefore, the covariance matrix of the measurement errors corresponding to Note since the age is assumed to be measured without error, the first row and column of Σu are zero.Appendix A: Proof of Proposition 1 Throughout the proof, we use the following equalities from Cook and Forzani (2009, p. 205), which are from Rao et al. (1973, p. 77).Let B ∈ R p×p be a symmetric positive definite matrix, and (Ψ, Ψ 0 ) ∈ R p×p be an orthogonal matrix.Then we have Recall that V = LW with L = ∆(∆ + Σ u ) −1 , and the inverse Gaussian model X | y ∼ N (µ y , ∆ y ) with µ y − µ = ∆Ψv y and where of Proposition 1 follows.For part (ii), the conditional distribution This conditional mean does not depend on y; indeed, the coefficient associated with v y equals To see the last equality, applying ( 7) with B = G y , we have ( , and furthermore, where steps (a) and (b) follow from Ψ T 0 ∆ −1 y = Ψ T 0 ∆ −1 .Therefore, where the last equality follows from applying (7) with B = L∆.Substituting it into the left hand side of (12) gives the claim.Hence the conditional expectation in ( 11) is Next, we maximize the log-likelihood with respect to µ at the optimized value of v m .Taking the derivative with respect to µ and setting it equal to zero, we obtain Noting that and the maximum over D is Finally, substituting D into (17) and removing terms that do not involve Ψ and L, we obtain the joint profile likelihood of Ψ and ∆ to be ( where c is a quantity that does not depend on Ψ or Ψ 0 .To show the consistency of the estimated central subspace, we show that the true SAVE estimator (i.e defined even when no measurement error is present) is the global maximum for V 0 (Ψ).
Finally, since we estimate L by L from the naive LAD estimator, we will prove that L is consistent for L. It suffices to prove that ∆n is a consistent estimator of E {Var(W | y}.By the relationship between LAD and SAVE, we have Ψn converges to a naive SAVE population Φ n that satisfies so (20) follows from M m=1 f m = 1.
examines the surrogate SDR problem when both the response and the covariates are distorted by multiplicative measurement errors, whileChen and Yi (2022) proposed a SDR method for survival data when the response is censored and the covariates are measured with error at the same time.Neither of these addresses the sparsity of the central subspace when p is large, and more broadly there is little direct research on inverse regression methods for SDR under measurement error.In this paper, we propose two likelihood-based estimators of the central subspace S y|X from the surrogate data (y i , W T i ).For the first estimator, we directly follow the approach of LAD and model the inverse predictor X | y as following a Gaussian distribution, where the existence of the central subspace imposes multiple constraints on the parameters and the covariates are subject to additive measurement errors.Measurement error is then incorporated into this model in a straightforward manner.We construct a likelihood-based function for the semi-orthogonal bases of the central subspace, and show that maximizing this function requires solving an optimization problem on the Grassmann manifold.For the second estimator, we apply the LAD approach on the adjusted surrogate X * .Although the two estimators make different adjustments to the adjusted surrogates, we show that these two estimators are asymptotically equivalent in terms of estimating the true central subspace.Furthermore, we propose a sparse estimator for the central subspace by augmenting the likelihood function with a penalty term that regularizes the elements of the corresponding projection matrix.The resulting objective function has a closed-form Riemann gradient, which facilitates efficient computation of the penalized estimator.Simulation studies and an application to a National Health and Nutrition Examination Survey from the United States demonstrate that the performance of the proposed likelihood-based estimators is superior to several common inverse moment-based estimators in terms of both estimating the central subspace and variable selection accuracy.
Σw and ∆wm are the sample (marginal) covariance matrix of W and the sample covariance matrix of W within the mth slice, and |A| denotes the determinant of the square matrix A.
Theorem 2. S * cLAD = S * IL−LAD = S * SAVE , where S * SAVE denote the population subspace spanned by the SAVE estimator when the true covariates X are observed.
and λ > 0 is a tuning parameter.Although it is possible to use other regularization functions to achieve sparse solutions, here we followWang et al. (2017) and choose the ℓ 1 norm regularizer to ease the optimization problem on the Grassmann manifold, as both the Euclidean and Riemann gradient of the objective function l2 (S) have a closed form.In more detail, for any matrix A of arbitrary dimension m × n, let vec(A) denote the mn-dimensional vector formed by stacking columns of A together.Similarly, let ivec(•) denote the inverse vectorization operator i.e., ivec{(vec(A)} = A. Let T m,n be a (unique) matrix of dimension mn × mn satisfying vec(A) = T m,n vec(A T ).Taking the Euclidean gradient of the regularization term with respect to Ψ, 2 + T p,p (Ψ ⊗ I p ) , moment-based estimates, including SIR (IL-SIR), the SAVE (IL-SAVE) and directional regression (IL-DR).To account for the sparsity of the central subspace, we compute the scLAD estimator and compare it with two of the invariance-law sparse estimates, namely the Lasso SIR of Lin et al. (2019) (IL-Lin) and the sparse SIR proposed by Tan et al. (2018) (IL-Tan).Both of these estimators are sparse versions of sliced inverse regression, where the first imposes sparsity on each sufficient direction and the second imposes sparsity on the projection matrix and induces row-sparsity.For the estimator proposed by Lin et al. (2019), we use the R package LassoSIR with its default settings.For the estimator proposed byTan et al. (2018), we use the code provided by the authors.For both estimators, we follow the recommend procedure from the authors and choose the tuning parameter based on a ten-fold cross-validation procedure.

Figure 1 :
Figure 1: Scatterplots of the outcome versus sufficient predictors obtained from the three sparse estimators of the central subspace in the application to the NHANES data.The blue curves are LOESS smoothing curves and the grey areas are confidence bands.
which does not depend on y either, where the second equality follows from (8) with B = G y .The proof is hence complete.where Z m := Ψ T L(W m − µ), B := Ψ T L∆Ψ, and B m := Ψ T G m Ψ. Differentiating (14) with respect to v m and setting the derivatives equal zero, we obtain 2f m BB −1 m (Z m − Bv m ) + λf m = 0, or equivalently, 2f m Z m − Bf m v m + f m B m B −1 λ = 0. Summing over all m and using the constraints, we have λ = M m=1 f m Z m := Z, and hence Z m − Bv m = B m B −1 λ.Therefore, at the optimized value of v m , (14) reduces to ℓ 2 (Ψ, ∆) =n log |Ψ T 0 (L Σw L T ) −1 Ψ 0 | − M m=1 n m log |Ψ T L ∆wm L T Ψ| = −n log |L Σw L T | + n log |Ψ T (L Σw L T )Ψ| − M m=1 n m log |Ψ T L ∆wm L T Ψ|.

Table 1 :
Mean projection errors of unregularized estimators in simulation settings with p = 40.The lowest value in each row is highlighted.

Table 2 :
Performance of the sparse estimators based on average projection error and F1 score for simulation settings with p = 40.The lowest projection error and the highest F1 score in each row are highlighted.