1 Introduction

In a regression setting with an outcome \(y \in \mathbb {R}\) and a covariate vector \(\textbf{X}\in \mathbb {R}^p\), sufficient dimension reduction (SDR) refers to a class of methods that express the outcome as a few linear combinations of \(\textbf{X}\) (Li 2018). In other words, SDR aims to estimate the matrix \(\textbf{B}\in \mathbb {R}^{p \times d}\) such that \(y \perp \textbf{X}\mid \textbf{B}^{\text {T}} \textbf{X}\), where \(d \ll p \). As the matrix \(\textbf{B}\) is generally not unique, the estimation target in SDR is the central subspace \(\mathcal {S}_{y \vert \textbf{X}}\), defined as the intersection of all the subspaces spanned by the columns of \(\textbf{B}\) satisfying the above conditional independence condition. This central subspace is unique under mild conditions (Glaws et al. 2020), and is characterized by the projection matrix associated with \(\textbf{B}\).

This article focuses on the problem of estimating the central subspace \(\mathcal {S}_{y \vert \textbf{X}}\) when the covariates \(\textbf{X}\) are measured with error. This problem is known in the literature as surrogate sufficient dimension reduction (Li and Yin 2007). Instead of observing a sample for \(\textbf{X}\), we observe a sample of \(\textbf{W}\), which are surrogates related to \(\textbf{X}\) by the classical additive measurement error model \(\textbf{W}= \textbf{X}+ \textbf{U}\), where \(\textbf{U}\) is a vector of measurement errors independent of \(\textbf{X}\) and which follows a p-variate Gaussian distribution with mean zero and covariance matrix \(\varvec{\Sigma }_u\). While more general measurement error models have also been proposed in the literature e.g., \(\textbf{W}= \textbf{A}\textbf{X}+ \textbf{U}\) for \(\textbf{A} \in \mathbb {R}^{r \times p}\) with \(r \ge p\) (Carroll and Li 1992; Li and Yin 2007; Zhang et al. 2014), we highlight that the classical measurement error model with \(\textbf{A} = \textbf{I}_p\) is the most widely used in practice (see Grace et al. 2021; Chen and Yi 2022; Chen 2023; Nghiem et al. 2023, for some recent examples) and so we focus on this case throughout the paper. Also, throughout the article we assume \(\varvec{\Sigma }_u\) is known; in practice, this covariance matrix may be estimated from auxiliary data, such as replicate observations, before subsequent analyses are conducted (Carroll et al. 2006). Our observed data thus consists of n pairs \((y_i, \textbf{W}_i^{\text {T}})\) with \(\textbf{W}_i = \textbf{X}_i + \textbf{U}_i\), \(i=1,\ldots , n\).

When the true covariates \(\textbf{X}_i\) are observed, there exists a vast literature on how to estimate the central subspace; see Li (2018) for an overview. These include traditional inverse moment-based methods such as sliced inverse regression (Li 1991) and sliced average variance estimation (Cook 2000), forward regression methods such as minimum average variance estimation (Xia et al. 2002) and directional regression (Li and Wang 2007), and inverse regression methods such as principal fitted components (Cook and Forzani 2008) and likelihood-acquired directions (LAD, Cook and Forzani 2009). Recent literature has also expanded these methods into more complex settings, such as high-dimensional data (e.g., Lin et al. 2021; Qian et al. 2019) and longitudinal data (e.g., Hui and Nghiem 2022). Among the estimators discussed above, the LAD method developed by Cook and Forzani (2009) has a unique advantage in that it is constructed from a well-defined Gaussian likelihood function and, as such, inherits the optimality properties of likelihood theory. Also, while the conditional Gaussianity assumption appears unnatural, Cook and Forzani (2009) show that the LAD estimator has superior performance even when Gaussianity does not hold; roughly speaking, the conditional normality plays the role of a “working model.” In this article, we focus on adapting LAD to the surrogate SDR problem.

The problem of surrogate SDR was first considered in Carroll and Li (1992), who showed that \(\mathcal {S}_{y \vert \textbf{X}}\) can be estimated by performing ordinary least squares, or sliced inverse regression, of the response on the adjusted surrogate \(\textbf{X}^* = \textbf{R}\textbf{W}\), where \(\textbf{R}= \varvec{\Sigma }_{xw}\varvec{\Sigma }_{w}^{-1} = (\varvec{\Sigma }_{w}-\varvec{\Sigma }_u)\varvec{\Sigma }_{w}^{-1}\). Here, \(\varvec{\Sigma }_{xw} = \text {Cov}(\textbf{X},\textbf{W})\) denotes the covariance matrix of \(\textbf{X}\) and \(\textbf{W}\) and \(\varvec{\Sigma }_{w} = \text {Var}(\textbf{W})\) denotes the variance-covariance matrix of \(\textbf{W}\). When \(\varvec{\Sigma }_u\) is known, these adjusted surrogates can be computed by replacing \(\varvec{\Sigma }_w\) with an appropriate estimator from the sample. This underlying idea was expanded into a broader, invariance law by Li and Yin (2007), who prove that if both \(\textbf{X}\) and \(\textbf{U}\) follow multivariate Gaussian distributions, then \(\mathcal {S}_{y \vert {\textbf{X}^*}} = \mathcal {S}_{y \vert \textbf{X}} \). As a result, under the assumption of Gaussianity of both \(\textbf{X}\) and \(\textbf{U}\), any consistent SDR method applied to y and the adjusted surrogate \({\textbf{X}}^*\) is also consistent. If \(\textbf{X}\) is non-Gaussian, this relationship is maintained if \(\textbf{B}^{\text {T}} \textbf{X}\) is approximately Gaussian. For instance, this is achieved when \(p \rightarrow \infty \) and each column of \(\textbf{B}\) is dense i.e., the majority of elements in each column are non-zero. On the other hand, when the number of covariates is large, it is often assumed that the central subspace is sparse (i.e., it depends only on a few covariates Lin et al. 2019), and so this Gaussian approximation may not be realistic or relevant in practice. Elsewhere, Zhang et al. (2012) examines the surrogate SDR problem when both the response and the covariates are distorted by multiplicative measurement errors, while Chen and Yi (2022) proposed a SDR method for survival data when the response is censored and the covariates are measured with error at the same time. Neither of these addresses the sparsity of the central subspace when p is large, and more broadly there is little direct research on inverse regression methods for SDR under measurement error.

In this paper, we propose two likelihood-based estimators of the central subspace \(\mathcal {S}_{y\vert \textbf{X}}\) from the surrogate data \((y_i, \textbf{W}_i^{\text {T}})\). For the first estimator, we directly follow the approach of LAD and model the inverse predictor \(\textbf{X}\mid y\) as following a Gaussian distribution, where the existence of the central subspace imposes multiple constraints on the parameters and the covariates are subject to additive measurement errors. Measurement error is then incorporated into this model in a straightforward manner. We construct a likelihood-based function for the semi-orthogonal bases of the central subspace, and show that maximizing this function requires solving an optimization problem on the Grassmann manifold. For the second estimator, we apply the LAD approach on the adjusted surrogate \(\textbf{X}^*\). Although the two estimators make different adjustments to the adjusted surrogates, we show that these two estimators are asymptotically equivalent in terms of estimating the true central subspace. Furthermore, we propose a sparse estimator for the central subspace by augmenting the likelihood function with a penalty term that regularizes the elements of the corresponding projection matrix. The resulting objective function has a closed-form Riemann gradient, which facilitates efficient computation of the penalized estimator. Simulation studies and an application to a National Health and Nutrition Examination Survey from the United States demonstrate that the performance of the proposed likelihood-based estimators is superior to several common inverse moment-based estimators in terms of both estimating the central subspace and variable selection accuracy.

The rest of the paper is organized as follows: In Sect. 2, we briefly review the LAD estimator of the central subspace, which is built upon a Gaussian inverse regression model. Then we incorporate measurement errors and discuss the properties of the corresponding model. We propose maximum likelihood estimators of the central subspace from the surrogate data in Sect. 3. In Sect. 4, we propose a regularized estimator to achieve sparsity in the estimated central subspace. Section 5 presents simulation studies to demonstrate the performance of the proposed estimators, while Sect. 6 illustrates the methodology in a data application. Finally, Sect. 7 contains some concluding remarks.

2 A Gaussian inverse regression model with measurement errors

We first review the LAD method proposed by Cook and Forzani (2009), when the true covariate vector \(\textbf{X}\) is observed. LAD models the inverse predictor \(\textbf{X}\mid y\) as having a multivariate Gaussian distribution with mean \(\varvec{\mu }_y\) and covariance \(\varvec{\Delta }_y\) i.e., \(\textbf{X}\mid y \sim N(\varvec{\mu }_y, \varvec{\Delta }_y)\). Importantly, the existence of the central subspace \(\mathcal {S}_{y \vert \textbf{X}}\) imposes some constraints on the parameters \(\varvec{\mu }_y\) and \(\varvec{\Delta }_y\). Let \(\varvec{\mu }= \text {E}(\textbf{X})\), \(\varvec{\Delta }= \text {E}(\varvec{\Delta }_y) = \text {E}\left\{ \text {Var}(\textbf{X}\mid y) \right\} \), and let \(\varvec{\Psi }\in \mathbb {R}^{p \times d}\) be a semi-orthogonal basis matrix for \(\mathcal {S}_{y \vert \textbf{X}}\) and \(\varvec{\Psi }_0 \in \mathbb {R}^{p \times (p-d)} \) be a matrix, such that \((\varvec{\Psi }, \varvec{\Psi }_0) \in \mathbb {R}^{p \times p}\) is an orthogonal matrix. Then \(\mathcal {S}_{y \vert \textbf{X}}\) is a central subspace if and only if the two following conditions are satisfied:

$$\begin{aligned}{} & {} (i)~ \varvec{\Psi }^{\text {T}} \textbf{X}\mid y \sim N(\varvec{\Psi }^{\text {T}} \varvec{\mu }+ \varvec{\Psi }^{\text {T}} \varvec{\Delta }\varvec{\Psi }\varvec{v}_y, \varvec{\Psi }^{\text {T}} \varvec{\Delta }_y \varvec{\Psi }), \nonumber \\ {}{} & {} (ii) ~ \varvec{\Psi }_0^{\text {T}} \textbf{X}\mid (\varvec{\Psi }^{\text {T}} \textbf{X}, y) \sim N \left\{ \textbf{H}\varvec{\Psi }^{\text {T}} \textbf{X}+ (\varvec{\Psi }_0^{\text {T}} - \textbf{H}\varvec{\Psi }^{\text {T}})\varvec{\mu }, \textbf{D}\right\} , \nonumber \\ \end{aligned}$$
(1)

where \(\varvec{v}_y \in \mathbb {R}^d\) denotes a deterministic function of y, \(\textbf{H}= (\varvec{\Psi }_0^{\text {T}} \varvec{\Delta }\varvec{\Psi })(\varvec{\Psi }^{\text {T}}\varvec{\Delta }\varvec{\Psi })^{-1}\) and \(\textbf{D}= (\varvec{\Psi }_0^{\text {T}} \varvec{\Delta }^{-1} \varvec{\Psi }_0)^{-1}\). While condition (i) allows the conditional distribution of \(\varvec{\Psi }^{\text {T}} \textbf{X}\mid y\) to depend on y, condition (ii) requires that the conditional distribution \(\varvec{\Psi }_0^{\text {T}} \textbf{X}| (\varvec{\Psi }^{\text {T}} \textbf{X}, y)\) does not depend on y. This aligns with the intuition that \(\varvec{\Psi }^{\text {T}} \textbf{X}\) is a sufficient predictor for y. Cook and Forzani (2009) also provide several equivalent characterisations of the two conditions in (1), for example, \(\varvec{\Psi }_0^{\text {T}} \varvec{\Delta }_y^{-1} = \varvec{\Psi }_0^{\text {T}} \varvec{\Delta }^{-1}\) and \(\varvec{\mu }_y - \varvec{\mu }= \varvec{\Delta }\varvec{\Psi }\varvec{v}_y\). The advantage of the conditions in Eq. (1) is that a log-likelihood function can be constructed and we can maximize it to consistently estimate all the parameters of the LAD model. As reviewed in Sect. 1, Cook and Forzani (2009) highlighted that Gaussianity of the inverse predictor \(\textbf{X}\mid y\) is not essential for LAD. Indeed, their simulation results show that the LAD estimator has superior performance to other SDR methods in terms of recovering the central subspace, even when the Gaussianity assumption is not satisfied.

When measurement errors that follow the classical additive measurement error model are present in the covariates so \(\textbf{X}\) is replaced by \(\textbf{W}\), condition (ii) in (1) is no longer satisfied. Specifically, we can straightforwardly show that the conditional mean of \(\varvec{\Psi }_0^{\text {T}} \textbf{W}\) given \(\varvec{\Psi }^{\text {T}} \textbf{W}\) and y is \( \text {E} ( \varvec{\Psi }_0^{\text {T}} \textbf{W}\vert \varvec{\Psi }^{\text {T}} \textbf{W}, y) = \varvec{\Psi }_0^{\text {T}} \varvec{\mu }+ \varvec{\Psi }_0^{\text {T}} \varvec{\Delta }\varvec{\Psi }\varvec{v}_y - (\textbf{H}\varvec{\Psi }^{\text {T}} \varvec{\Delta }_y \varvec{\Psi }+ \varvec{\Psi }_0^{\text {T}} \varvec{\Sigma }_u \varvec{\Psi })\left\{ \varvec{\Psi }^{\text {T}} (\varvec{\Delta }_y + \varvec{\Sigma }_u) \varvec{\Psi }\right\} ^{-1}\) \((\varvec{\Psi }^{\text {T}}\textbf{W}- \varvec{\Psi }^{\text {T}} \varvec{\Delta }\varvec{\Psi }\varvec{v}_y)\). This quantity generally depends on y, and so \(\varvec{\Psi }\) is no longer guaranteed to be a sufficient dimension reduction of the central subspace \(\mathcal {S}_{y | W}\). By a similar argument, condition (ii) is also not satisfied when \(\textbf{X}\) is replaced by \(\hat{\textbf{X}} = \textbf{R}\textbf{W}\), with \(\textbf{R}= \varvec{\Sigma }_x \varvec{\Sigma }_w^{-1}\). We remark that this result does not contradict the invariance law of Li and Yin (2007), since that was established under the assumption that the marginal distribution of \(\textbf{X}\) is Gaussian. Here, the Gaussian inverse regression model assumes only that the conditional distribution of \(\textbf{X}\mid y\) is Gaussian.

To overcome the challenges of inverse regression-based SDR with measurement error, we propose a new predictor of \(\textbf{X}\) from the observed surrogate \(\textbf{W}\) which satisfies similar properties to those in (1), so that we can apply inverse regression based SDR. To this end, let \(\textbf{V}= \textbf{L}\textbf{W}\) with \(\textbf{L}= \varvec{\Delta }(\varvec{\Delta }+ \varvec{\Sigma }_u)^{-1}\). The the result below ensures that if \(\varvec{\Psi }\) is a semi-orthogonal basis matrix for \(\mathcal {S}_{y \vert \textbf{X}}\), then it is also a semi-orthogonal basis matrix for \(S_{y \vert \textbf{V}}\).

Proposition 1

Let \(\textbf{G}_y = \textbf{L}(\varvec{\Delta }_y + \varvec{\Sigma }_u) \textbf{L}^{\text {T}}\) and \(\tilde{\textbf{D}} = (\varvec{\Psi }_0^{\text {T}} \varvec{\Delta }^{-1} \textbf{L}^{-1} \varvec{\Psi }_0)^{-1}\). The conditional distributions \(\varvec{\Psi }^{\text {T}} \textbf{V}| y\) and \(\varvec{\Psi }_0^{\text {T}} \textbf{V}\vert (\varvec{\Psi }^{\text {T}} \textbf{V}, y)\) are given by

$$\begin{aligned}{} & {} \!\!\! (i) ~\varvec{\Psi }^{\text {T}} \textbf{V}\mid y \sim N(\varvec{\Psi }^{\text {T}} \textbf{L}\varvec{\mu }+ \varvec{\Psi }^{\text {T}} \textbf{L}\varvec{\Delta }\varvec{\Psi }\varvec{v}_y, \varvec{\Psi }^{\text {T}} \textbf{G}_y \varvec{\Psi }), \nonumber \\{} & {} \!\!\!(ii) ~ \varvec{\Psi }_0^{\text {T}} \textbf{V}\mid (\varvec{\Psi }^{\text {T}} \textbf{V}, y) \sim N\{\textbf{H}\varvec{\Psi }^{\text {T}} \textbf{V}+ (\varvec{\Psi }_0^{\text {T}} - \textbf{H}\varvec{\Psi }^{\text {T}}) \textbf{L}\varvec{\mu }, \tilde{\textbf{D}} \}, \nonumber \\ \end{aligned}$$
(2)

The proof of Proposition 1 and all the other theoretical results can be found in the Appendix. In summary, \(\textbf{V}\) is purposefully constructed such that the conditional mean of \(\varvec{\Psi }^{\text {T}} \textbf{V}\vert y\) depends on y, but the conditional mean of \(\varvec{\Psi }_0^{\text {T}} \textbf{V}\vert (\varvec{\Psi }^{\text {T}} \textbf{V}, y)\) does not.

3 Maximum likelihood estimation

3.1 Corrected LAD estimator

We can now consider the problem of estimating the central subspace \(\mathcal {S}_{y \vert \textbf{X}}\) from the data \((y_i, \textbf{W}_i^{\text {T}})\), \(i=1,\ldots , n\). Similar to LAD and other SDR methods based on inverse-moments, we first partition the data into M non-overlapping slices based on the outcome data \(y_i\), for \(i=1,\ldots , n\). If the outcome is categorical, each slice corresponds to one category. If the outcome is continuous, these slices are constructed by dividing its range into M non-overlapping intervals. Let \(S_m \subset \{1, \ldots , n\}\) denote the index set of observations and \(n_m\) be the number of observations in the mth slice, \(m=1,\ldots ,M\). Next, assume the true covariate data within each slice are mutually independent and identically distributed, such that \( \textbf{X}_i^{(m)} \mid y_{i}^{(m)} \sim N(\varvec{\mu }_{y}^{(m)}, \varvec{\Delta }_{y}^{(m)})\) where \( \textbf{W}_i^{(m)} = \textbf{X}_i^{(m)} + \textbf{U}_i^{(m)}, ~ \textbf{U}_i^{(m)} \sim N(\varvec{0}, \varvec{\Sigma }_u), \) and we use the superscript m to index the slice to which the ith observation belongs. Furthermore, we set \(\text {E}\left( \varvec{\mu }_y^{(m)}\right) = \text {E}\left( \textbf{X}_i^{(m)}\right) = \varvec{\mu }\), and \(\text {E}\left( \varvec{\Delta }_y^{(m)}\right) = \varvec{\Delta }\), that is, these expectations do not depend on the slice m.

From Proposition 1, let \(\textbf{V}_{i}^{(m)} = \textbf{L}\textbf{W}_i^{(m)}\). Then we will estimate the central subspace of \(\mathcal {S}_{y \vert \textbf{X}}\) by maximizing the joint log-likelihood of \(\varvec{\Psi }^{\text {T}} \textbf{V}_i^{(m)}\) and \(\varvec{\Psi }_0^{\text {T}} \textbf{V}_i^{(m)}\) for \(i=1,\ldots , n\) and \(m=1,\ldots ,M\). Theorem 1 below gives the explicit form of this log-likelihood function when the dimension d is known.

Theorem 1

Let \(\varvec{\Psi }\in \mathbb {R}^{p \times d}\) be a semi-orthogonal basis matrix for \(\mathcal {S}_{y \vert \textbf{X}}\). Then the profile log-likelihood function for \(\varvec{\Psi }\) and \(\varvec{\Delta }\) from the observed data is given by

$$\begin{aligned} \ell _1(\varvec{\Psi }, \varvec{\Delta })&= -n \log \vert \textbf{L}\tilde{\varvec{\Sigma }}_w \textbf{L}^{\text {T}} \vert \nonumber \\ {}&\quad + n \log \vert \varvec{\Psi }^{\text {T}} (\textbf{L}\tilde{\varvec{\Sigma }}_w \textbf{L}^{\text {T}}) \varvec{\Psi }\vert \nonumber \\ {}&\quad - \sum _{m=1}^{M} n_m \log \vert \varvec{\Psi }^{\text {T}} \textbf{L}\tilde{\varvec{\Delta }}_{wm} \textbf{L}^{\text {T}} \varvec{\Psi }\vert , \end{aligned}$$
(3)

where \(\textbf{L}= \varvec{\Delta }(\varvec{\Delta }+ \varvec{\Sigma }_u)^{-1}\), \(\tilde{\varvec{\Sigma }}_{w}\) and \(\tilde{\varvec{\Delta }}_{wm}\) are the sample (marginal) covariance matrix of \(\textbf{W}\) and the sample covariance matrix of \(\textbf{W}\) within the mth slice, and \(\vert \textbf{A}\vert \) denotes the determinant of the square matrix \(\textbf{A}\).

The profile likelihood in Theorem 1 depends on both \(\varvec{\Psi }\) and \(\varvec{\Delta }\), so maximizing it is challenging. We propose to estimate \(\varvec{\Delta }\) first using a method-of-moments approach as follows: First, ignoring the measurement errors, compute the naïve LAD estimator \(\hat{\mathcal {S}}_{n}\) on the observed data, \((y_i, {\textbf{W}}_i^{\text {T}})\), \(i=1,\ldots ,n\). Let \(\hat{\varvec{\Psi }}_{\text {n}}\) denote a subsequent orthogonal basis of \(\hat{\mathcal {S}}^{n}\). Then an estimate of \(\text {E}\left\{ \text {Var}(\textbf{W}\vert y) \right\} \) is given by \(\hat{\varvec{\Delta }}_{\text {n}} = \{\hat{\varvec{\Psi }}_{\text {n}}(\hat{\varvec{\Psi }}_{\text {n}}^{\text {T}} \tilde{\varvec{\Delta }}\varvec{\Psi }_{\text {n}})^{-1} \hat{\varvec{\Psi }}_{\text {n}}^{\text {T}} + \tilde{\varvec{\Sigma }}^{-1} - \hat{\varvec{\Psi }}_{\text {n}}(\hat{\varvec{\Psi }}_{\text {n}}^{\text {T}} \tilde{\varvec{\Sigma }}\varvec{\Psi }_{\text {n}})^{-1} \hat{\varvec{\Psi }}_{\text {n}}^{\text {T}} \}^{-1}\), and we can estimate \(\hat{\varvec{\Delta }}\) by \(\hat{\varvec{\Delta }}_{\text {n}} - \varvec{\Sigma }_u\). Replacing \(\varvec{\Delta }\) by \(\hat{\varvec{\Delta }}\) in (3) and removing terms that do not depend on \(\varvec{\Psi }\), we then maximize

$$\begin{aligned} \begin{aligned} \ell _2(\varvec{\Psi })&= n \log \vert \varvec{\Psi }^{\text {T}} (\hat{\textbf{L}} \tilde{\varvec{\Sigma }}_{w} \hat{\textbf{L}}^{\text {T}}) \varvec{\Psi }\vert \\ {}&\quad -\sum _{m=1}^{M} n_m \log \vert \varvec{\Psi }^{\text {T}} \hat{\textbf{L}} \tilde{\varvec{\Delta }}_{wm} \hat{\textbf{L}}^{\text {T}} \varvec{\Psi }\vert , \end{aligned} \end{aligned}$$
(4)

where \( \hat{\textbf{L}} = \varvec{\Delta }(\varvec{\Delta }+ \varvec{\Sigma }_u)^{-1}\). The right hand side of (4) is invariant to rotation of \(\varvec{\Psi }\) by an orthogonal matrix \(\textbf{A}\in \mathbb {R}^{d \times d}\). That is, for any matrix \(\textbf{A}\) with \(\textbf{A}^{\text {T}} = \textbf{A}^{-1}\), we have \(\ell _2(\varvec{\Psi }\textbf{A}) = \ell _2 (\varvec{\Psi })\). Therefore, the function in (4) is actually a function of the column space of \(\varvec{\Psi }\), which is characterized by the projection matrix \(P_{\mathcal {S}} = \varvec{\Psi }\varvec{\Psi }^{\text {T}}\). Similar to LAD then, maximisation of (4) is performed over the Grassmann manifold \( \mathcal {S} \in \mathcal {G}_{d, p} \subset \mathbb {R}^{p}\), which is the subspace spanned by any \(p \times d\) basis matrix. Specifically, the function in (4) can be written as

$$\begin{aligned} \ell _2(\mathcal {S}) =&\log \Vert P_{\mathcal {S}} (\hat{\textbf{L}} \tilde{\varvec{\Sigma }}_w \hat{\textbf{L}}^{\text {T}}) P_{\mathcal {S}} \Vert _0 \nonumber \\ {}&- \quad \sum _{m=1}^{M} f_m \log \Vert P_{\mathcal {S}} \hat{\textbf{L}} \tilde{\varvec{\Delta }}_{wm} \hat{\textbf{L}}^{\text {T}} P_{\mathcal {S}} \Vert _0, \end{aligned}$$
(5)

where for any square matrix \(\textbf{A}\), we use \(\Vert \textbf{A}\Vert _0\) to denote the product of its non-zero eigenvalues. The corresponding Euclidean gradient of this objective function at a semi-orthogonal basis \(\varvec{\Psi }\) of \(\mathcal {S}\) is

$$\begin{aligned} \nabla \ell _2(\varvec{\Psi })&= (\hat{\textbf{L}} \tilde{\varvec{\Sigma }}_w \hat{\textbf{L}}^{\text {T}}) \varvec{\Psi }\left( \varvec{\Psi }^{\text {T}} \hat{\textbf{L}} \tilde{\varvec{\Sigma }}_w \hat{\textbf{L}}^{\text {T}} \varvec{\Psi }\right) ^{-1} \\&\quad - \sum _{m=1}^{M} f_m (\hat{\textbf{L}} \tilde{\varvec{\Delta }}_{wm} \hat{\textbf{L}}^{\text {T}}) \varvec{\Psi }\left( \varvec{\Psi }^{\text {T}} \hat{\textbf{L}} \tilde{\varvec{\Delta }}_{wm} \hat{\textbf{L}}^{\text {T}} \varvec{\Psi }\right) ^{-1}, \end{aligned}$$

and the corresponding Riemann gradient of \(\ell _2(\mathcal {S})\) is given by \({\text {grad}}\{{\ell }_2(\mathcal {S})\} = (\textbf{I}_p - P_{\mathcal {S}}) \nabla {\ell }_2(\varvec{\Psi })\). Note the Euclidean gradient \(\nabla \ell _2(\varvec{\Psi })\) is derived by treating \(\varvec{\Psi }\) as a collection of (pd) unstructured elements, and the Riemann gradient is the projection of this Euclidean gradient onto the tangent space of \(\mathcal {G}_{d, p}\) at the subspace \(\mathcal {S}\) spanned by \(\varvec{\Psi }\).

This closed-form Riemann gradient function facilitates the use of a trust region algorithm on the Grassmann manifold developed by Absil et al. (2007). Starting from an initial solution \(\hat{\mathcal {S}}^{(0)}\), the algorithm finds the next candidate \(\hat{\mathcal {S}}^{(1)}\) by first finding a solution to minimize a constrained objective function on the tangent space at \(\hat{\mathcal {S}}^{(0)}\), and then maps this solution to \(\mathcal {G}_{d, p}\). The constraints imposed on the tangent space are known as a trust region. The decision to accept the candidate and to expand the region or not is based on a quotient; different values of the quotient lead to one out of three possibilities: (i) accept the candidate and expand the trust region, (ii) accept the candidate and reduce the trust region, or (iii) reject the candidate and reduce the trust region. This procedure is carried out until convergence; see Gallivan et al. (2003) and Absil et al. (2007) for further details. We use the Manopt package (Boumal et al. 2014) in Matlab, which is a dedicated toolbox for optimization on manifolds and matrices, to maximize (5). In this implementation, there is no need to provide the Hessian of \(\ell _2(\mathcal {S})\), since the Manopt package automatically incorporates a numerical approximation for this matrix based on finite differences in the trustregion solver.

For the remainder of this article, we refer to the solution of this problem, and hence our proposed estimator, as the corrected LAD estimator (cLAD) of the central subspace.

3.2 Invariance-law LAD estimator

The cLAD estimator is constructed by maximizing a likelihood function involving the adjusted surrogate \(\textbf{V}_i = \textbf{L}\textbf{W}_i = {\varvec{\Delta }} ({\varvec{\Delta }} + {\varvec{\Sigma }}_u)^{-1} \textbf{W}_i\). In this section, we consider a likelihood-based estimator for the central subspace based instead on maximizing a likelihood function involving \(\textbf{X}^* = {\varvec{\Sigma }}_x{\varvec{\Sigma }}_w^{-1}\textbf{W}\), the adjusted covariate introduced in the invariance law of Li and Yin (2007). From the observed data, a sample of the adjusted covariate \(\textbf{X}^*\) can be constructed by \(\widehat{\textbf{X}}_i^* = \hat{\varvec{\Sigma }}_x \hat{{\varvec{\Sigma }}}_w^{-1}\textbf{W}_i^{\top }\), where \(\hat{{\varvec{\Sigma }}}_{w} = n^{-1}\sum _{i=1}\textbf{W}_i \textbf{W}_i^{\text {T}}\) and \(\hat{{\varvec{\Sigma }}}_x = \hat{{\varvec{\Sigma }}}_{w} -\varvec{\Sigma }_u\). The equivalence between \(\mathcal {S}_{y\vert X}\) and \(\mathcal {S}_{y\vert X^*}\) motivates applying LAD to (\(y_i, \hat{\textbf{X}}_i^*\)), to obtain the invariance-law LAD (IL-LAD) estimator \(\hat{\varvec{\Psi }}^*\) which maximizes the objective function

$$\begin{aligned} \ell _3(\mathcal {S}) = \log \Vert P_{\mathcal {S}} \tilde{\varvec{\Sigma }}^* P_{\mathcal {S}} \Vert _0 - \sum _{m=1}^{M} f_m \log \Vert P_{\mathcal {S}} \tilde{\varvec{\Delta }}_m^* P_{\mathcal {S}} \Vert _0, \end{aligned}$$

where \(\tilde{\varvec{\Sigma }}^*\) and \(\tilde{\varvec{\Delta }}_m^*\) denote the sample marginal covariance of \(\textbf{X}^*\) and the sample covariance of \(\textbf{X}^*\) within the mth slice, respectively.

Similar to the cLAD estimator, maximization of 3.2 is also performed over the Grassmann manifold \( \mathcal {S} \in \mathcal {G}_{d, p}\). The main difference between the estimators is that the cLAD estimator uses the conditional covariance \({\varvec{\Delta }} = \text {E}\left\{ \text {Var}(\textbf{X}\mid y\right\} \) and the IL-LAD estimator uses the marginal covariance \({\varvec{\Sigma }}_x\) to construct the adjusted covariate. Nevertheless, in the following subsection, we will prove that the IL-LAD and cLAD estimator are asymptotically equivalent. In finite samples, we found that the difference between the two estimators is often negligible.

3.3 Consistency

In this section, we establish the consistency of both the LAD and IL-LAD estimators, and show them to be asymptotically equivalent. We focus on the setting when d is known and p is fixed, similar to Cook and Forzani (2009). Also, note that while we assume the measurement error covariance matrix \({\varvec{\Sigma }}_u\) to be known, we highlight that the theoretical results in this section will continue to hold if \({\varvec{\Sigma }}_u\) is replaced by a consistent estimator \(\hat{{\varvec{\Sigma }}}_u\). An example of the latter is given in our data example in Sect. 6, where replications are available for the n observations. When the true covariates \(\textbf{X}\) are observed, Cook and Forzani (2009) proved the consistency of the LAD estimator by establishing the equivalence between the population subspace spanned by LAD and that spanned by the sliced average variance estimator (SAVE), \(\mathcal {S}_{\text {SAVE}} = {\varvec{\Sigma }}_x^{-1} \text {span} \left( {\varvec{\Sigma }}_x - {\varvec{\Delta }}_1, \ldots , {\varvec{\Sigma }}_x - {\varvec{\Delta }}_m \right) \). We use a similar argument here and prove that, when \(n \rightarrow \infty \), the subspace spanned by either the cLAD or IL-LAD estimators is equivalent to that of the SAVE estimator when the true covariates X are observed.

For any subspace \(\mathcal {S}\), the function \(n^{-1}\ell _2(\mathcal {S})\) from (5) converges in probability to \( K_2(\mathcal {S}) = \log \Vert P_{\mathcal {S}} (\textbf{L}(\varvec{\Sigma }_x + \varvec{\Sigma }_u) \textbf{L}^{\text {T}}) P_{\mathcal {S}} \Vert _0 - \sum _{m=1}^{M} f_m \log \Vert P_{\mathcal {S}} {\textbf{L}} (\varvec{\Delta }_y^{(m)} + \varvec{\Sigma }_u) \textbf{L}^{\text {T}} P_{\mathcal {S}} \Vert _0\), where \( f_m = n_m/n. \) The population cLAD subspace is then defined to be \(\mathcal {S}^*_{\text {LAD}} = \arg \max _{S} K_2(\mathcal {S})\). Similarly, the subspace spanned by the IL-LAD estimator converges to the population IL-LAD subspace \(\mathcal {S}^*_{\text {IL-LAD}}\) that maximizes \( K_3(\mathcal {S}) = \log \Vert P_{\mathcal {S}} \varvec{\Sigma }^* P_{\mathcal {S}} \Vert _0 - \sum _{m=1}^{M} f_m \log \Vert P_{\mathcal {S}} \varvec{\Delta }_m^* P_{\mathcal {S}} \Vert _0,\) with \({\varvec{\Sigma }}^* = \text {Var}(\textbf{X}^*) = {\varvec{\Sigma }}_x{\varvec{\Sigma }}_w^{-1}{\varvec{\Sigma }}_x\), and \({\varvec{\Delta }}_m^* = \text {Var}\left( \textbf{X}_i^{(m)*} \mid y_i^{(m)} \right) = {\varvec{\Sigma }}_x {\varvec{\Sigma }}_{w}^{-1}\left( {\varvec{\Delta }}_y^{(m)}+ {\varvec{\Sigma }}_u\right) {\varvec{\Sigma }}_{w}^{-1}{\varvec{\Sigma }}_x\). The main result below establishes the equivalence between the population cLAD and IL-LAD subspaces and the population SAVE subspace.

Theorem 2

\(\mathcal {S}^*_{\textrm{cLAD}} = \mathcal {S}^*_{\mathrm {IL-LAD}} = \mathcal {S}_{\textrm{SAVE}}\).

As noted in Sect. 3.2, the main difference between the cLAD and the IL-LAD estimators is the use of \({\varvec{\Delta }}\) versus \({\varvec{\Sigma }}_x\) in the construction of the adjusted surrogate. Nevertheless, asymptotically, they are both equivalent to the true SAVE estimator, since the orthogonal complement \({\varvec{\Phi }}_0\) corresponding to the SAVE estimator satisfies \({\varvec{\Phi }}_0^{\text {T}}op {\varvec{\Delta }}^{-1} = {\varvec{\Phi }}_0^{\text {T}}op {\varvec{\Sigma }}_x^{-1}\). Similar to Proposition 2 in Cook and Forzani (2009), Theorem 2 does not require any distributional assumptions on the model, and only depends on the properties of positive definite matrices and the concavity of the log determinant function. Nevertheless, the result implies that the cLAD estimator is consistent whenever the true LAD estimator and SAVE (i.e., those computed assuming \(\textbf{X}\) were known) are consistent for the central subspace \(\mathcal {S}_{y \vert \textbf{X}}\). As shown in Li and Wang (2007) and Cook and Forzani (2009), these two estimators are consistent under a linearity and constant covariance condition.

4 Sparse surrogate dimension reduction

When the number of covariates is large, it is typically assumed that the sufficient predictors \(\textbf{B}^{\text {T}} \textbf{X}\) depend on only a few covariates (e.g., Qian et al. 2019). In other words, the true matrix \(\textbf{B}\) is row-sparse. Since only the column space of \(\textbf{B}\) is identifiable from the samples, we translate the sparsity of \(\textbf{B}\) into the element-wise sparsity of the projection matrix, and impose a corresponding penalty to achieve sparse solutions. Below, we formalize this idea with the proposed cLAD estimator, although an analogous procedure can be applied to the IL-LAD estimator.

We propose to maximize the regularized objective function

$$\begin{aligned} \tilde{\ell }_2(\mathcal {S})&= \ell _2(\mathcal {S}) - \lambda \Vert P_{\mathcal {S}} \Vert _1 \nonumber \\&= \log \Vert P_{\mathcal {S}} (\hat{\textbf{L}} \tilde{\varvec{\Sigma }} \hat{\textbf{L}}^{\text {T}}) P_{\mathcal {S}} \Vert _0 \nonumber \\ {}&\quad - \sum _{m=1}^{M} f_m \log \Vert P_{\mathcal {S}} \hat{\textbf{L}} \tilde{\varvec{\Delta }}_m \hat{\textbf{L}}^{\text {T}} P_{\mathcal {S}} \Vert _0 - \lambda \Vert P_{\mathcal {S}} \Vert _1, \end{aligned}$$
(6)

where for any matrix \(\textbf{A}\) with elements \(a_{ij}\), we denote \(\Vert \textbf{A}\Vert _1 = \sum _{i,j} \vert a_{ij} \vert \), and \(\lambda > 0\) is a tuning parameter. Although it is possible to use other regularization functions to achieve sparse solutions, here we follow Wang et al. (2017) and choose the \(\ell _1\) norm regularizer to ease the optimization problem on the Grassmann manifold, as both the Euclidean and Riemann gradient of the objective function \(\tilde{\ell }_2(\mathcal {S})\) have a closed form.

In more detail, for any matrix \(\textbf{A}\) of arbitrary dimension \(m \times n\), let \(\text {vec}(\textbf{A})\) denote the mn-dimensional vector formed by stacking columns of \(\textbf{A}\) together. Similarly, let \(\text {ivec}(\cdot )\) denote the inverse vectorization operator i.e., \({\text {ivec}}\{({\text {vec}}(\textbf{A})\} = \textbf{A}\). Let \(\textbf{T}_{m,n} \) be a (unique) matrix of dimension \(mn \times mn\) satisfying \(\text {vec}(\textbf{A}) = \textbf{T}_{m, n} \text {vec}(\textbf{A}^{\text {T}})\). Taking the Euclidean gradient of the regularization term with respect to \(\varvec{\Psi }\), we obtain

$$\begin{aligned}&{\text {vec}}\left\{ \partial \frac{\left\| \varvec{\Psi }\varvec{\Psi }^{T}\right\| _{1}}{\partial \varvec{\Psi }}\right\} ^{\text {T}}={\text {vec}}\left\{ {\text {sgn}}\left( {\varvec{\Psi }}{\varvec{\Psi }}^{\text {T}}\right) \right\} ^{\text {T}} \frac{\partial \varvec{\Psi }\varvec{\Psi }^{\text {T}}}{\partial \varvec{\Psi }}, \\ {}&\frac{\partial \varvec{\Psi }\varvec{\Psi }^{\text {T}}}{\partial \varvec{\Psi }} = \left( \textbf{I}_{p^2} + \textbf{T}_{p, p}\right) \left( \varvec{\Psi }\otimes \textbf{I}_{p}\right) , \end{aligned}$$

where \(\textbf{I}_{p^2}\) is the identity matrix of dimension \(p^2 \times p^2\) and \(\otimes \) denotes the Kronecker product. As a result, the Euclidean gradient of the \(\tilde{\ell }_2(\mathcal {S})\) at any semi-orthogonal matrix \(\varvec{\Psi }\) is \( \nabla \tilde{\ell }_2(\varvec{\Psi }) = \nabla \ell _2(\varvec{\Psi }) - \lambda \,\text {ivec} \left\{ \text {vec}\left( \left\| \varvec{\Psi }\varvec{\Psi }^{T}\right\| _{1}/\partial \varvec{\Psi }\right) \right\} , \) and the corresponding Riemann gradient is \( {\text {grad}}\tilde{\ell }_2(\mathcal {S}) = (\textbf{I}_p - \varvec{\Psi }\varvec{\Psi }^{\text {T}}) \nabla \tilde{\ell }_2(\varvec{\Psi }). \) These Euclidean and Riemann gradients are used in the same trust region algorithm of Absil et al. (2007) to maximize \(\tilde{\ell }_2(\mathcal {S})\). We computed the maximizer for (6) on a grid of the tuning parameter \(\lambda \) that consists of 40 logarithmically equally-spaced values between 0 and \(\lambda _{\text {max}}\), where \(\lambda _{\text {max}}\) is the value where the maximizer \(P_{\hat{\mathcal {S}}}\) of (6) is approximately close to the identity matrix. In our experiment, we found that setting \(\lambda _{\max }\) to 1 leads to a good performance and that the algorithm converges quickly to a stable solution at each value of the tuning parameter \(\lambda \) in the grid.

Finally, to select the optimal tuning parameter \(\lambda \), we use a variant of the projection information criterion (PIC) developed by Nghiem et al. (2022). Let \(\hat{\mathcal {S}}_0\) be an initial consistent, non-sparse estimator of the central subspace, and let \(\hat{\mathcal {S}}(\lambda )\) be the maximizer of (6) associated with \(\lambda \). We propose to select \(\lambda \) by minimizing \( \textsc {PIC}(\lambda ) = \Vert P_{\hat{\mathcal {S}}(\lambda )} - P_{\hat{S}_0}\Vert _F^2 + p^{-1}\log (p) {\text {df}}(P_{\hat{\mathcal {S}}(\lambda )}), \) where \({\text {df}}(P_{\hat{\mathcal {S}}(\lambda )}) = d(s_{\lambda }-d)\) with \(s_{\lambda }\) being the number of non-zero, diagonal elements of \(P_{\hat{\mathcal {S}}(\lambda )}\). The first term in \(\textsc {PIC}(\lambda )\) is a measure of goodness-of-fit, while the second term is a measure of complexity. Compared to the information criterion developed in Nghiem et al. (2022), this modified criterion has a different complexity term so as to more effectively quantify the number of parameters of the corresponding Grassmann manifold. In our numerical studies, we choose the initial consistent estimator \(\hat{\mathcal {S}}_0\) to be the estimated central subspace corresponding to the unregularized cLAD estimator, and find that this choice usually leads to good variable selection performance. For the remainder of this article, we refer to the maximizer of (6), with the tuning parameter selected via this projection criterion, as the sparse corrected LAD (scLAD) estimator.

5 Simulation studies

We conduct a numerical study to examine the performance of the proposed cLAD and IL-LAD estimators in finite samples. We generate the true predictors and outcome from the following four single/multi-index models (i) \(y_i = (0.5)(\textbf{X}_i^{\text {T}} \varvec{\beta }_1)^3 + 0.25~ \vert \textbf{X}_i^{\text {T}} \varvec{\beta }\vert \varepsilon _i\), (ii) \(y_i = 3(\textbf{X}_i^{\text {T}} \varvec{\beta }_1)/(1+\textbf{X}_i^{\text {T}} \varvec{\beta }_1)^2 + 0.25 \varepsilon _i\), (iii) \(y_i = 4\sin (\textbf{X}_i^{\text {T}} \varvec{\beta }_2/4) + 0.5(\textbf{X}_i^{\text {T}} \varvec{\beta }_1)^2 + 0.25\varepsilon _i\), and (iv) \(y_i = 3(\textbf{X}_i^{\text {T}} \varvec{\beta }_1)\exp (\textbf{X}_i^{\text {T}} \varvec{\beta }_2 + 0.25 \varepsilon _i) \), with \(\varepsilon _i \sim N(0,1)\), and \(\textbf{W}_i = \textbf{X}_i + \textbf{U}_i\) for \(i=1,\ldots , n\). The single index models (i) and (ii) are similar to those considered in Lin et al. (2019) and Nghiem et al. (2022), while the multiple index models (iii) and (iv) are similar to those considered in Reich et al. (2011). The true central subspace in models (i) and (ii) is the subspace spanned by \(\textbf{B}= \varvec{\beta }_1\), while in models (iii) and (iv) it is the column space of \(\textbf{B}= [\varvec{\beta }_1, \varvec{\beta }_2]\). Moreover, we set \(\varvec{\beta }_1 = (1, 1, 1, 0, \ldots , 0)^{\text {T}}\) and \(\varvec{\beta }_2 = (0, 0, 1, 1, 1, 0, \ldots , 0)^{\text {T}}\) such that the true central subspace for the single index models depends only on the first three covariates, while for the multiple index models it depends only on the first five covariates across two indices.

Next, we generate the true predictors \(\textbf{X}_i\) from one of the following three choices: (1) a p-variate Gaussian distribution \(N(\varvec{0}, \varvec{\Sigma }_x)\), (2) a p-variate t distribution with three degrees of freedom and the same covariance matrix \(\varvec{\Sigma }_x\), and (3) a p-variate half-Gaussian distribution \(\vert N(\varvec{0}, \varvec{\Sigma }_x) \vert \), where for all three choices we set the covariance matrix to have an autoregressive structure \(\sigma _{xij} = 0.5^{\vert i-j \vert }\). Turning to the measurement error, we generate \(\textbf{U}_i\) from a multivariate Gaussian distribution \(N(\varvec{0}, \varvec{\Sigma }_u)\), where \(\varvec{\Sigma }_u\) is set to a diagonal matrix with elements drawn from a uniform U(0.2, 0.5) distribution. The sample size n is set to either 1000 or 2000, while the number of covariates p is set to either 20 or 40. We assume that \(\varvec{\Sigma }_u\) and the structural dimension d are known. For each simulation configuration, we generated 200 samples.

We point out that the data generation process used for the simulation study is often referred to as a forward index model, because we first generate the covariate \(\textbf{X}\) from a marginal distribution, and then generate the response y from a conditional distribution i.e., from \(y\mid \textbf{X}\), which is the same as \(y \mid \textbf{B}^{\text {T}}op \textbf{X}\). More importantly, we highlight that the data generation processes are not the same as those imposed by the Gaussian inverse regression models. Particularly, with these simulation configurations, the conditional distribution \(\textbf{X}_i \mid y_i\) is generally not Gaussian due to the presence of non-linear link functions. As such, with these simulation configurations, the conditional Gaussianity only plays the role of a “working model.” This type of data generation process is also used in Cook and Forzani (2009).

For each simulated dataset, we compute the two unregularized likelihood-based estimates, cLAD and IL-LAD, and compare them with several invariance-law inverse-moment-based estimates, including SIR (IL-SIR), the SAVE (IL-SAVE) and directional regression (IL-DR). To account for the sparsity of the central subspace, we compute the scLAD estimator and compare it with two of the invariance-law sparse estimates, namely the Lasso SIR of Lin et al. (2019) (IL-Lin) and the sparse SIR proposed by Tan et al. (2018) (IL-Tan). Both of these estimators are sparse versions of sliced inverse regression, where the first imposes sparsity on each sufficient direction and the second imposes sparsity on the projection matrix and induces row-sparsity. For the estimator proposed by Lin et al. (2019), we use the R package LassoSIR with its default settings. For the estimator proposed by Tan et al. (2018), we use the code provided by the authors. For both estimators, we follow the recommend procedure from the authors and choose the tuning parameter based on a ten-fold cross-validation procedure.

We assess the performance of all estimators based on the Frobenius norm of the difference between the projection matrix associated with the true central subspace, and that of the estimated central subspace. For the sparse estimators, we assess variable selection performance based on the following metric. Let \(\textbf{P}\) denote the projection matrix associated with the true directions, i.e., \(\textbf{P} = \textbf{B}\left( \textbf{B}^{\text {T}} \textbf{B}\right) ^{-1}\textbf{B}^{\text {T}}\), and let \(\hat{\textbf{P}}\) be an estimator of \(\textbf{P}\). We then define the number of true positives (TP) to be the number of non-zero diagonal elements that are correctly identified to be non-zeros, the number of false negatives (FN) to be the number of non-zero diagonal elements that are incorrectly identified to be zeros and the number of false positives (FP) to be the number of zero diagonal elements that are incorrectly identified to be non-zeros. We also compute the F1 score as 2TP/(2TP + FP + FN). This metric ranges from zero to one, with a value of one indicating perfect variable selection.

For brevity, we present the projection error and F1 variable selection results for \(p=40\) below; the results for \(p=20\) and including FP and FN selection results offer overall similar conclusions to those seen below and are deferred to the Supplementary Materials. Table 1 demonstrates that among the unregularized estimators, the likelihood-based estimators cLAD and IL-LAD have smaller estimation error in the majority of settings compared to the other, invariance-law inverse-moment-based estimators. This result is consistent with Cook and Forzani (2009) who demonstrated that LAD, in general, tends to exhibit superior performance relative to other inverse-moment-based estimators like SIR and SAVE when no measurement error is present. The performance of all the considered estimators tends to deteriorate when the true covariates X deviate from Gaussianity, such as when they are skewed or have heavier tails. There is negligible difference in the performance between the cLAD and IL-LAD estimator.

Next, the left half of Table 2 demonstrates the estimation performance of the sparse SDR estimators. In terms of projection error, and analogous to the results with unregularized methods above, the proposed scLAD estimator tends to have lower estimation error than the IL-Lin and IL-Tan estimators. The improvement is most pronounced in the two multiple index models, i.e. Model (iii) and Model (iv), and when the true covariate-vector \(\textbf{X}\) follows a Gaussian distribution. Again similar to the unregularized estimators, the performance of these sparse estimators deteriorates when the true covariates \(\textbf{X}\) deviate from Gaussianity, especially when they follow a heavy tail distribution such as \(t_3\). As an aside, we note that results presented in Table 2 for the scLAD estimator represent the contributions of both the likelihood and sparsity to estimation performance. The results for the contribution of the likelihood function alone i.e., the unregularized cLAD estimator, were presented in Table , and for reasons of brevity we chose to avoid explicitly repeating these results in Table 2.

Table 1 Mean projection errors of unregularized estimators in simulation settings with \(p=40\)
Table 2 Performance of the sparse estimators based on average projection error and F1 score for simulation settings with \(p=40\)

Turning to the variable selection results, the scLAD estimator had the best overall selection performance among the three considered estimators. When the true covariate vector \(\textbf{X}\) follows a Gaussian distribution, the scLAD estimator selects the true set of important covariates across all considered settings, as reflected in the corresponding F1 scores all being exactly equal to one.

This conclusion still holds when the true covariates follow a half-Gaussian distribution and under the single index models (i) and (ii). However, for the other settings such as the multiple index models (iii) and (iv), scLAD tends to have lower F1 scores than IL-Tan for \(n=1000\), although the trend is reversed when the sample size increases to \(n=2000\). By contrast, the IL-Lin estimator consistently has a very low F1 score, with additional results in the Supplementary Materials demonstrating that this estimator incurs too many false positives. This reflects the advantage of imposing regularisation directly on the diagonal elements of the corresponding projection matrix, compared to doing so on each dimension separately. However, we acknowledge that the variable selection results do depend on how the tuning parameter is chosen, and we are not aware of any method that is guaranteed to achieve selection consistency when we have to estimate the central subspace from surrogates.

6 National health and nutrition examination survey data

We apply the proposed methodology to analyze a dataset from the National Health and Nutrition Examination Survey (NHANES) (Centers for Disease Control and Prevention 2022). This survey aims to assess the health and nutritional status of people in the United States and track the evolution of this status over time. During the 2009-2010 survey period, participants were interviewed and asked to provide their demographic background as well as information about nutrition habits, and to undertake a series of health examinations. To assess the nutritional habits of participants, dietary data were collected using two 24-hour recall interviews wherein the participants self-reported the consumed amounts of a set of food items during the 24 hours prior to each interview. Based on these recalls, daily aggregated consumption of water, food energy, and other nutrition components such as total fat and total sugar consumption were computed.

In this application, we focus on the relationship between participants’ total cholesterol level (y), their age (Z), and their daily intakes of 42 nutrition components, such as sugars, total vitamins, fats, retinol, lycopene, zinc, and selenium, among many others. We restrict our analysis to \(n = 3343\) women, and assume participants’ ages are measured accurately while the daily intakes of nutrition components are subject to additive measurement errors. For the ith participant, let \(\textbf{W}_{i1}\) and \(\textbf{W}_{i2}\) denote the \(43 \times 1\) vector of surrogates at the first and second interview, respectively. For each vector, the first element is the age (\(Z_i\)) and the remaining elements are the recorded values for nutrition components at the corresponding interview time. We assume the classical measurement error model \(\textbf{W}_{ij} = \textbf{X}_i + \textbf{U}_{ij}\) for \(i=1,\ldots ,n\) and \(j=1,2\), where \(\textbf{X}_i\) denotes the vector of the long–term nutrition intakes, and \(\textbf{U}_i \sim N(\varvec{0}, {\varvec{\Sigma }}^*_u)\) denotes the measurement errors. Moreover, as \(\text {E}\left\{ (\textbf{W}_{i1}- \textbf{W}_{i2}) (\textbf{W}_{i1}- \textbf{W}_{i2})^{\text {T}} \right\} = 2{\varvec{\Sigma }}^*_u\), we estimate \(\widehat{{\varvec{\Sigma }}}^*_u = (2n)^{-1} \sum _{i=1}^{n} (\textbf{W}_{i1}- \textbf{W}_{i2}) (\textbf{W}_{i1}- \textbf{W}_{i2})^{\text {T}}\). Therefore, the covariance matrix of the measurement errors corresponding to \(\textbf{W}_i = (1/2)(\textbf{W}_{i1} + \textbf{W}_{i2})\) is estimated as \(\hat{\varvec{\Sigma }}_u = \widehat{\varvec{\Sigma }}_u^*/2\), which is a consistent estimator of \(\varvec{\Sigma }_u\) under general regularity conditions as \(n \rightarrow \infty \) (Carroll et al. 1993). Note since the age is assumed to be measured without error, the first row and column of \(\hat{\varvec{\Sigma }}_u\) are zero.

We estimate the central subspace \(\mathcal {S}_{y \vert \textbf{X}}\) using the three sparse estimators in the simulation studies, setting \(d=1\) and \(H=20\) slices. Letting \(\hat{\varvec{\beta }}^{(\text {IL-Tan})}\) and \(\hat{\varvec{\beta }}^{(\text {IL-Lin})}\) be the estimated bases from the IL-Tan and IL-Lin estimators, respectively, the sufficient predictors are estimated to be \(\hat{T}_i^{(\text {IL-Tan})} = \hat{\textbf{X}}_i\hat{\varvec{\beta }}^{(\text {IL-Tan})}\) and \(\hat{T}_i^{(\text {IL-Lin})}=\hat{\textbf{X}}_i\hat{\varvec{\beta }}^{(\text {IL-Lin})}\), where \(\hat{\textbf{X}}_i = (\hat{\varvec{\Sigma }}_w-\hat{\varvec{\Sigma }}_u)\hat{\varvec{\Sigma }}_w^{-1} \textbf{W}_i\). For the scLAD estimator \(\hat{\varvec{\beta }}^{(\text {scLAD})}\), we denote the corresponding sufficient predictor as \( \hat{T}_i^{(\text {scLAD})} = \hat{\textbf{V}}_i\hat{\varvec{\beta }}^{(\text {scLAD})}\), where \(\hat{\textbf{V}}_i = \hat{\varvec{\Delta }}(\hat{\varvec{\Delta }} + \hat{\varvec{\Sigma }}_u)^{-1}\textbf{W}_i\). Figure 1 presents the plots of y against the three sufficient predictors.

Fig. 1
figure 1

Scatterplots of the outcome versus sufficient predictors obtained from the three sparse estimators of the central subspace in the application to the NHANES data. The blue curves are LOESS smoothing curves and the grey areas are confidence bands

It can be seen that the sufficient predictors formed from scLAD and IL-Lin estimators are relatively similar to each other, and are more informative about the total cholesterol level than the IL-Tan estimator. Indeed, the sample correlation between the outcome and \(\hat{\textbf{T}}^{(\text {IL-Tan})}\) is 0.04, but the sample correlations between the outcome and \(\hat{\textbf{T}}^{(\text {IL-Lin})}\) and \(\hat{\textbf{T}}^{(\text {IL-scLAD})}\) are 0.45 and 0.43, respectively. In terms of variable selection, out of 43 predictors, the IL-Lin estimator selects 27 variables, the IL-Tan estimator selects 15 variables, while the proposed scLAD estimator selects 17 variables. The simulation results suggest that IL-Lin is potentially overfitting. Among the variables selected by scLAD, the three predictors that have the largest magnitudes are copper, vitamin B12, and zinc.

7 Discussion

In this article, we propose two likelihood-based estimators for the central subspace when the predictors are contaminated by additive measurement errors. These estimators are constructed from a Gaussian inverse regression model \(\textbf{X}\mid y\) into which measurement error is incorporated in a straightforward manner. The two estimators are based on maximizing the likelihood-based objective functions of two different adjusted covariates, the first one using the conditional covariance \(\text {E}\left\{ \text {Var}(\textbf{X}\mid y) \right\} \) and the second one using the marginal covariance of \(\textbf{X}\). We establish the asymptotic equivalence of these two estimators. When the number of covariates is large, we propose a sparse corrected LAD estimator that facilitates variable selection by introducing a penalty term that regularizes the projection matrix directly. The corresponding objective function is defined on an appropriate Grassmann manifold, and has closed-form gradients that facilitate efficient computation. Simulation studies generating data from forward index models demonstrate that the likelihood-based estimators have superior performance to the inverse moment-based estimators in terms of both estimation and variable selection consistency. Note that in our simulation studies, we did not consider generating data directly from the inverse regression model. We anticipate that the overall conclusions would be similar to, if not better than in terms of favoring our proposed estimators, those presented in this article with forward index models i.e., where we generated the covariate \(\textbf{X}\) from the marginal distribution and then generated the response y from the conditional distribution \(y \mid \textbf{B}^{\text {T}}op \textbf{X}\).

Future research can address the theoretical properties of the sparse likelihood-based estimators, particularly their variable selection consistency (see for instance Lin et al. 2019; Nghiem et al. 2022, for related works), and develop likelihood-based estimators of the central subspace that are robust to non-Gaussian measurement errors. Furthermore, we can consider extending the proposed method to the more general linear measurement error model, \(\textbf{W}= \textbf{A}\textbf{X}+ \textbf{U}\), with \(\textbf{A}\ne \textbf{I}_p\). We conjecture that our proposed estimators, along with their corresponding theoretical results, will still hold for this more general model (assuming \(\textbf{A}\) is either known or can be estimated consistently). Furthermore, the issue of estimating the number of dimensions d from the surrogate data remains open.

8 Supplementary materials

The online Supplementary Materials contain additional simulation results that are referred in Sect. 5. The MATLAB code implementing the corrected estimators is available at https://github.com/lnghiemum/LAD-ME.