Abstract
Functional data such as curves and surfaces have become more and more common with modern technological advancements. The use of functional predictors remains challenging due to its inherent infinite dimensionality. The common practice is to project functional data into a finite dimensional space. The popular partial least square method has been well studied for the functional linear model (Delaigle and Hall in Ann Stat 40(1):322–352, 2012). As an alternative, quantile regression provides a robust and more comprehensive picture of the conditional distribution of a response when it is non-normal, heavy-tailed, or contaminated by outliers. While partial quantile regression (PQR) was proposed in (Yu et al. in Neurocomputing 195:74–87, 2016)[2], no theoretical guarantees were provided due to the iterative nature of the algorithm and the non-smoothness of quantile loss function. To address these issues, we propose an alternative PQR formulation with guaranteed convergence. This novel formulation motivates new theories and allows us to establish asymptotic properties. Numerical studies on a benchmark dataset show the superiority of our new approach. We also apply our novel method to a functional magnetic resonance imaging data to predict attention deficit hyperactivity disorder and a diffusion tensor imaging dataset to predict Alzheimer’s disease.
Similar content being viewed by others
References
Delaigle A, Hall P (2012) Methodology and theory for partial least squares applied to functional data. Ann Stat 40(1):322–352
Koenker R, Bassett G (1978) Regression quantiles. Econometrica 46(1):33–50
Yu D, Kong L, Mizera I (2016) Partial functional linear quantile regression for neuroimaging data analysis. Neurocomputing 195:74–87
Kato K (2012) Estimation in functional linear quantile regression. Ann Stat 40(6):3108–3136
Koenker R (2005) Quantile Regression. Cambridge University Press, Cambridge
James GM, Wang J, Zhu J (2009) Functional linear regression that’s interpretable. Ann Stat 37(5):2083–2108
Cardot H, Crambes C, Sarda P (2005) Quantile regression when the covariates are functions. Nonparametr Stat 17(7):841–856
Sun Y (2005) Semiparametric efficient estimation of partially linear quantile regression models. Ann Econ Financ 6(1):105
Cardot H, Ferraty F, Sarda P (2003) Spline estimators for the functional linear model. Stat Sin 13(3):571–592
Zhao Y, Ogden RT, Reiss PT (2012) Wavelet-based lasso in functional linear regression. J Comput Gr Stat 21(3):600–617
Lu Y, Du J, Sun Z (2014) Functional partially linear quantile regression model. Metrika 77(2):317–332
Tang Q, Cheng L (2014) Partial functional linear quantile regression. Sci China Math 57(12):2589–2608
Hall P, Horowitz JL (2007) Methodology and convergence rates for functional linear regression. Ann Stat 35(1):70–91
Lee ER, Park BU (2012) Sparse estimation in functional linear regression. J Multivar Anal 105(1):1–17
Wold H (1975) Soft modeling by latent variables: the nonlinear iterative partial least squares approach. Perspectives in probability and statistics, papers in honour of M.S. Bartlett, pp 520–540
Helland IS (1990) Partial least squares regression and statistical models. Scand J Stat 17(2):97–114
Frank LE, Friedman JH (1993) A statistical view of some chemometrics regression tools. Technometrics 35(2):109–135
Nguyen DV, Rocke DM (2004) On partial least squares dimension reduction for microarray-based classification: a simulation study. Comput Stat Data Anal 46(3):407–425
Abdi H (2010) Partial least squares regression and projection on latent structure regression (PLS regression). Wiley Interdiscip Rev Comput Stat 2(1):97–106
Preda C, Saporta G (2005) PLS regression on a stochastic process. Comput Stat Data Anal 48(1):149–158
Zhao Y, Chen H, Ogden RT (2015) Wavelet-based weighted lasso and screening approaches in functional linear regression. J Comput Gr Stat 24(3):655–675
Wu Y, Liu Y (2009) Variable selection in quantile regression. Stat Sin 19(2):801
Zheng S (2011) Gradient descent algorithms for quantile regression with smooth approximation. Int J Mach Learn Cybern 2(3):191–207
Muggeo VM, Sciandra M, Augugliaro L (2012) Quantile regression via iterative least squares computations. J Stat Comput Simul 82(11):1557–1569
Chen C (2012) A finite smoothing algorithm for quantile regression. J Comput Gr Stat 16(1):136–164
Crambes C, Kneip A, Sarda P (2009) Smoothing splines estimators for functional linear regression. Ann Stat 37(1):35–72
Tu W, Liu P, Zhao J, Liu Y, Kong L, Li G, Jiang B, Tian G, Yao H (2019) M-estimation in low-rank matrix factorization: a general framework. In: 2019 IEEE international conference on data mining (ICDM), pp 568–577
Zhu R, Niu D, Kong L, Li Z (2017) Expectile matrix factorization for skewed data analysis. In: thirty-first AAAI conference on artificial intelligence
Van der Vaart AW (2000) Asymptotic Statistics, vol 3. Cambridge University Press, Cambridge
Hjort NL, Pollard D (2011) Asymptotics for minimisers of convex processes. arXiv preprint arXiv:1107.3806
Nesterov Y (2005) Smooth minimization of non-smooth functions. Math Progr 103(1):127–152
Li X, Xu D, Zhou H, Li L (2018) Tucker tensor regression and neuroimaging analysis. Stat Biosci 10:520–545
Zhou H, Li L, Zhu H (2013) Tensor regression with applications in neuroimaging data analysis. J Am Stat Assoc 108(502):540–552
Yu K, Moyeed RA (2001) Bayesian quantile regression. Stat Probab Lett 54(4):437–447
Sánchez B, Lachos H, Labra V (2013) Likelihood based inference for quantile regression using the asymmetric laplace distribution. J Stat Comput Simul 81:1565–1578
Rothenberg TJ (1971) Identification in parametric models. Econometrica 39(3):577–591
De Leeuw J (1994) Block-relaxation algorithms in statistics. Studies in classification, data analysis, and knowledge organization. In: Bock H, Lenski W, Richter MM (eds) Information systems and data analysis. Springer, New York, pp 308–324
Yu D, Zhang L, Mizera I, Jiang B, Kong L (2019) Sparse wavelet estimation in quantile regression with multiple functional predictors. Comput stat Data Anal 136:12–29
Wang Y, Kong L, Jiang B, Zhou X, Yu S, Zhang L, Heo G (2019) Wavelet-based lasso in functional linear quantile regression. J Stat Comput Simul 89(6):1111–1130
Golubitsky M, Guillemin V (2012) Stable mappings and their singularities, vol 14. Springer, New York
van der Vaart AW, Wellner JA (2000) Weak Convergence. Springer, New York
Acknowledgements
Linglong Kong, Ivan Mizera, and Bei Jiang acknowledge funding support for this research from the Natural Sciences and Engineering Research Council of Canada (NSERC). Dengdeng Yu acknowledges funding support for this research from the start-up grant from university of Texas at Arlington. Linglong Kong and Dengdeng Yu further acknowledge the funding support from the Canadian Statistical Sciences Institute (CANSSI).
Author information
Authors and Affiliations
Corresponding author
Appendix 1 Proofs and Verification
Appendix 1 Proofs and Verification
The appendix contains detailed proofs of the results that are missing in the main paper.
1.1 Verification of Assumptions of Theorem 1
Let us first check assumption (i). For a given fixed N, we have that \(\rho _{\tau \nu _N}\) is differentiable, and so \(l_N(\varvec{\upsilon })\) is continuous. As \(||\varvec{\upsilon }||\rightarrow \infty\), that is, as \(||(\alpha ,\varvec{\beta })||\rightarrow \infty\) or \(||{\textbf{C}}||\rightarrow \infty\), the function \(l_N\) should approach \(-\infty\). Therefore, \(l_N\) is coercive. To verify assumption (ii), observe that the function \(y_i - \alpha - {\textbf{x}}_i^T \varvec{\beta }- {\textbf{z}}_{i}^T {\textbf{C}}_{} {\textbf{1}}_K\) is an affine function about \({\textbf{C}}\). Since \(- \rho _\tau\) and its approximation \(-\rho _{\tau \nu _N}\) are strictly concave, we have that \(l_N(\varvec{\upsilon })\) and \(l(\varvec{\upsilon })\) are both strictly concave about \({\textbf{C}}\). Since they are also strictly concave about \(\alpha\), we have strict concavity of \(l_N(\varvec{\upsilon })\) and \(l(\varvec{\upsilon })\) about \(\varvec{\upsilon }\). Lastly, assumption (iii) assures that a locally optimized point is isolated. One important property regarding isolated stationary points is that, if the Hessian matrix at a stationary point is nonsingular, then the stationary point is an isolated one [40]. Alternatively, we can impose Condition A1 of [5] requiring distribution functions \(F_i\) to be absolutely continuous with continuous density \(f_i\) uniformly bounded away from 0 and \(\infty\) at the point \(\xi _i(\tau ) = F_i^{-1}(\tau )\), where \(F_i\) is the conditional distribution of \(y_i\) given \({\textbf{z}}_{i}\). Lemma 2 of [30] states that, if we have a sequence of convex functions \(l_N(\varvec{\upsilon })\) defined on an open convex set S in \({\mathbb {R}}^{Kd+p+1}\) on which \(l_N\) converges pointwisely to l, then \(\sup _{\varvec{\upsilon }\in K}|l_N(\varvec{\upsilon })-l(\varvec{\upsilon })|\) approaches zero for each compact subset \({\textbf{K}}\) of S. As long as the maxima is a unique interior point S, we have that the maxima of \(l_N\) will approach the maxima of l.
1.2 Proof of Theorem 3
This result can be verified by using Theorem 1 of [36], reproduced below in Lemma 6. The regularity assumptions of this lemma are satisfied by the current model since (1) the parameter space S is open, (2) the density \(p(y,{\textbf{z}}\mid {\textbf{C}})\) is proper for all \({\textbf{C}} \in S\), (3) the support of the density \(p(y,{\textbf{z}}\vert {\textbf{C}})\) is the same for all \({\textbf{C}} \in S\), (4) the log density \(l_N({\textbf{C}}\vert y,{\textbf{z}}) = \ln p(y,{\textbf{z}}\vert {\textbf{C}})\) is continuously differentiable, and (5) the information matrix \(I_N({\textbf{C}})\) is continuous in \({\textbf{C}}\) by Theorem 2. Then by Lemma 6, \({\textbf{C}}\) is locally identifiable if and only if \(I_N({\textbf{C}})\) is nonsingular.
Lemma 6
([36], Theorem 1) Let \(\theta _0\) be a regular point of the information matrix \(I(\theta )\). Then \(\theta _0\) is locally identifiable if and only if \(I(\theta _0)\) is nonsingular.
1.3 Proof of Theorem 4
Proof
For simplicity, we omit \(\alpha\) and \(\varvec{\beta }\), though the conclusions generalize easily to the case with them. we want to show that the consistency of the estimated factor matrix \(\hat{{\textbf{C}}}_n\). The following well-known theorem is our major tool for establishing consistency.
Lemma 7
([29], Theorem 5.7) Let \(M_n\) be random functions and let M be a fixed function of \(\theta\) such that for every \(\epsilon > 0\)
Then any sequence of estimators \({\hat{\theta }}_n\) with \(M_n({\hat{\theta }}) \ge M_n(\theta _0) - o_P(1)\) converges in probability to \(\theta _0\).
To apply Lemma 7 in our setting, we take the nonrandom function M to be \({\textbf{C}} \mapsto {\mathbb {P}}_{{\textbf{C}}_0} \left[ l_N (Y,Z\mid {\textbf{C}})\right]\) and the sequence of random functions to be \(M_n: {\textbf{C}} \mapsto \frac{1}{n} \sum _{i=1}^n l_N (y_i,z_i\mid {\textbf{C}}) = {\mathbb {P}}_n M\), where \({\mathbb {P}}_n\) denotes the empirical measure under \({\textbf{C}}_0\). Then \(M_n\) converges to M a.s. by strong law of large number. The second condition requires that \({\textbf{C}}_0\) is a well-separated maximum of M. This is guaranteed by the (global) identifiability of \({\textbf{C}}_0\) and information inequality. The first uniform convergence condition is most convenient and is verified by the Glivenko-Cantelli theory [29].
The density is \(p_{{\textbf{C}}} (y \mid {\textbf{z}}) = const \cdot \exp \left[ - \rho _{\tau \nu _N} \left( y-\eta ({\textbf{C}},{\textbf{z}})\right) \right]\) where \(\eta ({\textbf{C}},{\textbf{z}}) = \langle {\textbf{C}}, {\textbf{z}}\rangle\). Take \(m_{{\textbf{C}}}=ln\left[ (p_{{\textbf{C}}}+p_{{\textbf{C}}_0})/2\right]\). First we show that \({\textbf{C}}_0\) is a well-separated maximum of the function \(M({\textbf{C}}):={\mathbb {P}}_{{\textbf{C}}_0} m_{{\textbf{C}}}\). The global identifiability of \({\textbf{C}}_0\) and information inequality guarantee that \({\textbf{C}}_0\) is the unique maximum of M. To show that it is a well-separated maximum, we need to verify that \(M({\textbf{C}}_k) \rightarrow M({\textbf{C}}_0)\) implies \({\textbf{C}}_k \rightarrow {\textbf{C}}_0\).
Suppose \(M({\textbf{C}}_k) \rightarrow M({\textbf{C}}_0)\), then \(\langle {\textbf{C}}_k, {\textbf{Z}} \rangle \rightarrow \langle {\textbf{C}}_0, {\textbf{Z}} \rangle\) in probability. If \({\textbf{C}}_k\) are bounded, then \({\textbf{E}} \left[ \langle {\textbf{C}}_k-{\textbf{C}}_0, {\textbf{Z}} \rangle ^2\right] \rightarrow 0\) and \({\textbf{C}}_k \rightarrow {\textbf{C}}_0\) by nonsigularity of \({\textbf{E}}\left[ (\textrm{vec}{\textbf{Z}})(\textrm{vec}{\textbf{Z}})^T\right]\). On the other hand, \({\textbf{C}}_k\) cannot run to infinity. If they do, then \(\langle {\textbf{C}}_k, {\textbf{Z}} \rangle / \left| {\textbf{C}}_k \right| \rightarrow 0\) in probability which in turn implies that \({\textbf{C}}_k / \left| C_k\right| \rightarrow {\textbf{0}}.\)
For the uniform convergence, we see that the class of functions \(\{ \langle {\textbf{C}},{\textbf{Z}}\rangle , {\textbf{C}} \in S \}\) forms a VC class. This is true because it is a collection of number of polynomials of degree 1 and then apply the VC vector space argument [41, 2.6.15]. This implies that \(\{ \eta (\langle {\textbf{C}},{\textbf{Z}}\rangle ), {\textbf{C}} \in S \}\) is a VC class since \(\eta\) is a monotone function [41, 2.6.18].
Now \(m_{{\textbf{C}}}\) is Lipschitz in \(\eta\) since
The last equality holds since \(\rho _{\tau \nu _N}(u) \rightarrow \rho _\tau (u)\) as \(N \rightarrow \infty\), which also implies that \(\rho ^\prime _{\tau \nu _N}(u) \rightarrow \rho ^\prime _{\tau }(u)\) as \(N \rightarrow \infty\) except for \(u=0\). And we know that \(\rho ^\prime _{\tau \nu _N}(0)=0\). Similarly we can show that \(m_{{\textbf{C}}}\) is Lipschitz in \(\eta _0\). A Lipschitz composition of a Donsker class is still a a Donsker class [29, 19.20]. Therefore, \(\left\{ {\textbf{C}} \mapsto m_{\textbf{C}} \right\}\) is a bounded Donsker class with the trivial envelope function 1. A Donsker class is certainly a Glivenko-Cantelli class. Finally, the Glivenko-Cantelli theorem establishes the uniform convergence condition required by Lemma 7.
When the parameter is restricted to a compact set, \(\eta (\langle {\textbf{C}}, {\textbf{Z}}\rangle )\) is confined in a bounded interval and the \(l_N\) is Lipschitz on the finite interval. It follows that \(\{ l_N({\textbf{C}}) = l_N \circ \eta \circ \langle {\textbf{C}}, {\textbf{Z}}\rangle , {\textbf{C}} \in S\}\) is a Donsker class as composition with a monotone or Lipschitz function preserves the Donsker class. Therefore, the Glivenko-Cantelli theorem establishes the uniform convergence. Compactness of parameter space implies that \({\textbf{C}}_0\) is a well-separated maximum if it is the unique maximizer of \(M({\textbf{C}}) = {\mathbb {P}}_{{\textbf{C}}_0} m_{{\textbf{C}}}\) [29, Exercise 5.27]. Uniqueness is guaranteed by the information equality whenever \({\textbf{C}}_0\) is identifiable. This verifies the consistency for quantile regression. \(\square\)
Lemma 8
Tensor quantile linear regression model (3) is quadratic mean differentiable (q.m.d.).
Proof
By a well-known result [29, Lemma 7.6], it suffices to verify that the density is continuously differentiable in parameter for \(\mu\)-almost all x and that the Fisher information matrix exists and it continuous. The derivative of density is
which is well defined and continuous by Proposition 2. The same proposition shows that the information matrix exists and is continuous. Therefore, the tensor quantile linear regression model is q.m.d. \(\square\)
1.4 Proof of Theorem 5
Proof
The following result relates asymptotic normality to the density that satisfies q.m.d.
Lemma 9
At an inner point \(\theta _0\) of \(\nu \subset {\textbf{R}}^k\). Furthermore, suppose that there exists a measurable function \({\dot{l}}\) with \({\textbf{P}}_{\theta _0} {\dot{l}}^2 < \infty\) such that, for every \(\theta _1\) and \(\theta _2\) in a neighborhood of \(\theta _0\) ,
If the Fisher information matrix \(I_{\theta _0}\) is nonsingular and \(\hat{\theta _n}\) is consistent, then
In particular, the sequence \(\sqrt{n} ({\hat{\theta }}_n -\theta _0)\) is asymptotically normal with mean zero and covariance matrix \(I^{-1}_{\theta _0}\).
Lemma 8 shows that tensor quantile regression linear model is q.m.d. By Theorem 2 and chain rule, the score function
is uniformly bounded in y an \({\textbf{x}}\) and continuous in \({\textbf{C}}\) for every y and \({\textbf{x}}\) with \({\textbf{C}}\) ranging over a compact set of \(S_0\). For sufficiently small neighborhood U of \(S_0\), \(\sup _U \left| {\dot{l}}_{N}({\textbf{C}}) \right|\) is square-integrable. Thus, the local Lipschitz condition is satisfied and Lemma 9 applies. \(\square\)
Rights and permissions
About this article
Cite this article
Yu, D., Pietrosanu, M., Mizera, I. et al. Functional Linear Partial Quantile Regression with Guaranteed Convergence for Neuroimaging Data Analysis. Stat Biosci (2024). https://doi.org/10.1007/s12561-023-09412-7
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12561-023-09412-7