Abstract
We demonstrate that, in a regression setting with a Hilbertian predictor, a response variable is more likely to be more highly correlated with the leading principal components of the predictor than with trailing ones. This is despite the extraction procedure being unsupervised. Our results are established under the conditional independence model, which includes linear regression and single-index models as special cases, with some assumptions on the regression vector. These results are a generalisation of earlier work which showed that this phenomenon holds for predictors which are real random vectors. A simulation study is used to quantify the phenomenon.
Similar content being viewed by others
References
Arnold, B. C., Brockett, P. L. (1992). On distributions whose component ratios are cauchy. American Statistician, 46(1), 25–26.
Artemiou, A., Li, B. (2009). On principal components regression: A statistical explanation of a natural phenomenon. Statistica Sinica, 19, 1557–1565.
Artemiou, A., Li, B. (2013). Predictive power of principal components for single-index model and sufficient dimension reduction. Journal of Multivariate Analysis, 119, 176–184.
Cook, R. (2007). Fisher lecture: Dimension reduction in regression. Statistical Science, 22(1), 1–26.
Cox, D. R. (1968). Notes on some aspects of regression analysis. Journal of the Royal Statistical Society Series A (General), 131(3), 265–279.
Dauxois, J., Ferré, L., Yao, A.-F. (2001). Un modèle semi-paramétrique pour variables aléatoires hilbertiennes. Comptes Rendus de l’Académie des Sciences, 333(1), 947–952.
Ferré, L., Yao, A. F. (2003). Functional sliced inverse regression analysis. Statistics, 37(6), 475–488.
Hall, P., Yang, Y. J. (2010). Ordering and selecting components in multivariate or functional data linear prediction. Journal of the Royal Statistical Society Series B: Statistical Methodology, 72(1), 93–110.
Hsing, T., Eubank, R. (2015). Theoretical foundations of functional data analysis, with an introduction to linear operators. 1st ed. West Sussex: Wiley.
Kingman, J. F. C. (1972). On random sequences with spherical symmetry. Biometrika, 59(2), 492.
Li, B. (2007). Comment: Fisher lecture—Dimension reduction in regression. Statistical Science, 22(1), 32–35.
Li, B. (2018). Sufficient dimension reduction: Methods and applications with R. 1st ed. Boca Raton: CRC Press.
Li, B., Song, J. (2017). Nonlinear sufficient dimension reduction for functional data. The Annals of Statistics, 45(3), 1059–1095.
Li, Y. (2007). A note on hilbertian elliptically contoured distributions. Unpublished manuscript, Department of Statistics, University of Georgia.
Ni, L. (2011). Principal component regression revisited. Statistica Sinica, 21, 741–747.
Pinelis, I., Molzon, R. (2016). Optimal-order bounds on the rate of convergence to normality in the multivariate delta method. Electronic Journal of Statistics, 10(1), 1001–1063.
Ramsay, J., Silverman, B. W. (1997). Functional data analysis. 1st ed. New York: Springer.
Acknowledgements
We would like to thank the Editor, Associate Editor and two reviewers for their constructive comments and suggestions which helped improve an earlier version of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Essential definitions
For the benefit of the reader, we present here some fundamental definitions in functional data analysis. These definitions can be found in Hsing and Eubank (2015) along with a deeper exposition of the field. We first define random variables and nuclear operators in and on a Hilbert space, respectively. Although our interest lies in the case where the random variables are random functions, the definitions are given for the more general setting of Hilbertian random variables. We note that this more abstract framework includes function spaces where the functions need not be univariate so this paper applies to, say, predictors which are random fields. This work therefore is relevant to a number of fields including FMRI data analysis, spatial statistics, image processing and speech recognition.
Definition 2
Let \({{\left( \varOmega , \mathfrak {F}, {{\mathbb {P}}}\right) }}\) be a probability space and \({{\left( {{{\mathcal {H}}}},\ {{\mathcal {B}}}{\left( {{{\mathcal {H}}}}\right) }\right) }}\) be a measureable space where \({{\mathcal {H}}}\) is a Hilbert space and \({{\mathcal {B}}}{\left( {{\mathcal {H}}}\right) }\) is its associated Borel \(\sigma \)-field. A measureable function \({{{X}}: {{{\left( \varOmega , \mathfrak {F}, {{\mathbb {P}}}\right) }}} \rightarrow {{{\left( {{{\mathcal {H}}}},\ {{\mathcal {B}}}{\left( {{{\mathcal {H}}}}\right) }\right) }}}}\) is called an H-valued random variable. We also say that \({X}\) is a Hilbertian random variable.
Definition 3
Let \({{\mathcal {H}}}\) be a Hilbert space. A compact operator, that is one which is the operator norm limit of a sequence of finite rank operators, \({{L}: {{{\mathcal {H}}}} \rightarrow {{{\mathcal {H}}}}}\) is said to be a nuclear operator if the sum of its eigenvalues is finite.
Remark 4
The class of nuclear operators on a Hilbert space contains the class of all operators which have finitely many nonzero eigenvalues.
The expectation of a Hilbertian random variable is defined in terms of the Bochner integral—the construction is given in Hsing and Eubank (2015) and is similar to that for the Lebesgue integral so we will not present it here. For our purposes, it is enough to note that for a Hilbertian random variable a, the expectation \({{{\mathbb {E}}}{\left( {a}\right) }}\) is unique, an element of the space \({{\mathcal {H}}}\), and satisfies
Remark 5
Observe that the expectation on the left-hand side is the expectation of a real random variable, whereas the expectation on the right side is the expectation of an \({{\mathcal {H}}}\)-valued random variable.
We will also require a generalisation of the notion of variance for a Hilbertian random variable, but first we define a tensor product operation.
Definition 4
Let \(x_{1}, x_{2}\) be elements of Hilbert spaces \({{\mathcal {H}}}_{1}\) and \({{\mathcal {H}}}_{2}\), respectively. The tensor product operator\({{{\left( x_{1} \otimes _{1} x_{2}\right) }}: {{{\mathcal {H}}}_{1}} \rightarrow {{{\mathcal {H}}}_{2}}}\) is defined by
for \(y \in {{\mathcal {H}}}_{1}\). If \({{\mathcal {H}}}_{1} = {{\mathcal {H}}}_{2}\), we use \(\otimes \) instead of \(\otimes _{1}\).
In the case where \({{\mathcal {H}}}_{1} = {{\mathcal {H}}}_{2} = {{\mathbb {R}}}^{p}\), we have that \({{x_{1}} \otimes {x_{2}}} = x_{2} {{x_{1}}^{{\mathrm{T}}}}\) so the usual covariance matrix can be written as
This notation will also be used for a covariance operator, but with \({X}\) being an \({{\mathcal {H}}}\)-valued random variable. We note that all covariance operators on a Hilbert space \({{\mathcal {H}}}\) are compact, nonnegative definite, and self-adjoint. Proofs of these facts can be found in Pinelis and Molzon (2016).
Assuming that the covariance operator of some predictor \({X}\) is nuclear gives meaning to the phrase “PCA captures most of the variability in the data” for the infinite-dimensional setting. This is because it supplies a notion of how much variance there is in total.
The notion of a spherical distribution was central to the work of Artemiou and Li (2013). In the case of data in an infinite-dimensional space, this notion cannot be generalised as is explained below but the idea of an elliptical distribution can. We will thus make use of this concept instead. The following definition is given by Li (2007).
Definition 5
A Hilbertian random variable A, in a Hilbert space \({{\mathcal {H}}}\), has an elliptically symmetric distribution if the characteristic function of \(A - {{{\mathbb {E}}}{\left( {A}\right) }}\) has the following form:
for all \(f \in {{\mathcal {H}}}\), where \(\varPsi \) is a self-adjoint, nonnegative definite, nuclear operator on \({{\mathcal {H}}}\), and \(\varphi \) is a univariate function.
We note that—in the infinite-dimensional Hilbert space setting—\(\varPsi \) in Definition 5 cannot be the identity operator as it is noncompact and thus not nuclear. It can be shown that \(\varPsi \) is, up to multiplication by a constant, the covariance operator (when it exists) of the Hilbertian random variable—the requirement then that the sum of the eigenvalues of A is finite is equivalent to the sum of the variances of the principal components of A being finite. We conclude that we cannot extend the notion of a spherically symmetric distribution to the entirety of an infinite-dimensional space. Note that we can have sphericity in a finite-dimensional subspace.
Appendix B: Proofs
Lemma 1: Define \(\varPhi \) as an operator on \({{\mathcal {H}}}\) by \(\varPhi {\left( x\right) } = {{\left\langle {{g}}, {x} \right\rangle _{{{\mathcal {H}}}}}}\). This operator takes a fixed \(x \in {{\mathcal {H}}}\) and returns a real random variable so it is a random operator. By the Riesz representation theorem, this random operator can be identified with a random element of the dual space \({{{{{\mathcal {H}}}}^{*}}}\) (that is the space of all continuous linear functions from the space \({{\mathcal {H}}}\) into the base field) so there is a unique random adjoint operator \({{{\varPhi }^{*}}}\) such that for all fixed \(x \in {{\mathcal {H}}}\) and \(y \in {{\mathbb {R}}}\), \(\varPhi {\left( x\right) }y = {{\left\langle {x}, {{{{\varPhi }^{*}}}{\left( y\right) }} \right\rangle _{{{\mathcal {H}}}}}}\). It is easy to see that for any fixed \(y \in {{\mathbb {R}}}\), \({{{\varPhi }^{*}}}{\left( y\right) } = y{g}\). We show that, almost surely, \({{{\mathbb {E}}}{\left( {{X}}{\left. \vert \right. }{\varPhi {{\left( X\right) }}, {{g}, {{\varvec{\Gamma }}}}}\right) }}\) is orthogonal to \({{\mathrm {Span}}{\left( {{{\varvec{\Gamma }}}{g}}\right) }}^{\perp }\). For convenience, let T be the tuple \({\left( \varPhi {{\left( X\right) }}, {{g}, {{\varvec{\Gamma }}}}\right) }\). Let \(x \in {{\mathrm {Span}}{\left( {{{\varvec{\Gamma }}}{g}}\right) }}^{\perp }\), which is a random variable in \({{\mathcal {H}}}\), then we have the following:
which implies that for any fixed \(y \in {{\mathbb {R}}}\)
where the first and second equalities follow from the linearity and self-adjointedness of \({{\varvec{\Gamma }}}\). The above now implies that \(\varPhi {\left( {{\varvec{\Gamma }}}x\right) } = 0\) and therefore \({{\varvec{\Gamma }}}x \in {{\mathrm {Ker}}{\left( {\varPhi }\right) }}\).
Consider now \({{{\mathbb {E}}}{\left( {{{\left\langle {x}, {{{{\mathbb {E}}}{\left( {{X}}{\left. \vert \right. }{T}\right) }}} \right\rangle _{{{\mathcal {H}}}}}}^{2}}\right) }}\). Showing this to be 0 gives the result, as it is the expectation of a squared random variable.
where the second equality follows from Eq. 5; the third and fourth equalities follow by moving the second inner product into the expectation; the fifth equality uses Eq. 5 again. Now by Assumption 4, there is a real constant A such that \({{{\mathbb {E}}}{\left( {{{\left\langle {x}, {{X}} \right\rangle _{{{\mathcal {H}}}}}}}{\left. \vert \right. }{T}\right) }} = A\varPhi {{\left( X\right) }}\).
Therefore,
\(\square \)
Lemma 2: S is an element of \({l}^{2}\) because \({{\mathcal {H}}}\) and \({l}^{2}\) are isomorphic, up to isometry, and by the same reasoning S is elliptically distributed. Now let \({{P}: {{l}^{2}} \rightarrow {{{\mathbb {R}}}^{n}}}\) be the operator which truncates a sequence at the \({{n}^{\mathrm {th}}}\) term. This operator is compact and therefore bounded, so by Theorem 4 of Li (2007), the vector \(T_{n}\) is elliptically distributed. \(\square \)
Theorem 5: From the definition of correlation:
Now, recall that conditional expectation is a self-adjoint operator in the covariance inner product. That is for any random variables \(U_{1}, U_{2},U_{3}\), we have
Consider:
where the third equality follows as \({{{Y}} {}{\,\perp \!\perp \,}{{X}} {\left. \vert \right. }{{\left( {{{{{\left\langle {{g}}, {{X}} \right\rangle _{{{\mathcal {H}}}}}}}}, {g}, {{\varvec{\Gamma }}}}\right) }}}\). As Assumption 4 holds, there is a real constant \({\alpha }_{i}\) such that \({{{\mathbb {E}}}{\left( {{{{{\left\langle {{{\phi _{i}}}}, {{X}} \right\rangle _{{{\mathcal {H}}}}}}}}}{\left. \vert \right. }{{{{{{\left\langle {{g}}, {{X}} \right\rangle _{{{\mathcal {H}}}}}}}}, {g}, {{\varvec{\Gamma }}}}}\right) }} = {\alpha }_{i}{{{{\left\langle {{g}}, {{X}} \right\rangle _{{{\mathcal {H}}}}}}}}\), and similarly for j. Thus, Eq. 7 becomes:
Substituting this into Eq. 6, we find that
Thus,
As \({g}{}{\,\perp \!\perp \,}{\left( {{X}, {{\varvec{\Gamma }}}}\right) }\), \({{\mathrm {Var}}{\left( {{{{{\left\langle {{{\phi _{i}}}}, {{X}} \right\rangle _{{{\mathcal {H}}}}}}}}}{\left. \vert \right. }{{{g}, {{\varvec{\Gamma }}}}}\right) }}= {{\mathrm {Var}}{\left( {{{{{\left\langle {{{\phi _{i}}}}, {{X}} \right\rangle _{{{\mathcal {H}}}}}}}}}{\left. \vert \right. }{{{\varvec{\Gamma }}}}\right) }} = {{\lambda _{i}}}\) and similarly for j. Thus,
Now look back at Eq. 7. By Eq. 5, we see that
By Lemma 1, \({{{\mathbb {E}}}{\left( {{X}}{\left. \vert \right. }{{{{{{\left\langle {{g}}, {{X}} \right\rangle _{{{\mathcal {H}}}}}}}}, {g}, {{\varvec{\Gamma }}}}}\right) }} = c{{\varvec{\Gamma }}}{g}\) for some constant c. Hence,
Now we have
Consequently,
and similarly for \({\alpha }_{j}\). So Eq. 8 can be rewritten as
Now by Assumption 5, \({{\left\{ {{{{\left\langle {{{\phi _{k}}}}, {{g}} \right\rangle _{{{\mathcal {H}}}}}}}}\right\} }}_{k \in {{{{\mathbb {N}}}\cap \left[ 1, {n}\right] }}}\) is spherically symmetric for any n. Therefore, by Theorem 1 of Arnold and Brockett (1992), \(\frac{{{{\left\langle {{{\phi _{i}}}}, {{g}} \right\rangle _{{{\mathcal {H}}}}}}}}{{{{\left\langle {{{\phi _{j}}}}, {{g}} \right\rangle _{{{\mathcal {H}}}}}}}}\) has a standard Cauchy distribution. Thus,
\(\square \)
Theorem 6: The proof is similar to that of Theorem 5 up to the point where we have shown that:
Now as \({g}\) has an elliptical distribution, we apply Lemma 2 and Theorem 2 of Arnold and Brockett (1992) to find that \({{{\left\langle {{{\phi _{i}}}}, {{g}} \right\rangle _{{{\mathcal {H}}}}}}}/{{{\left\langle {{{\phi _{j}}}}, {{g}} \right\rangle _{{{\mathcal {H}}}}}}}\) has a general Cauchy distribution with scale parameter \({\gamma _{ij}}\) and location \({\kappa _{ij}}\). Thus,
Using \(\arctan {\left( -x\right) } = - \arctan {\left( x\right) } \), we have that the above is equal to:
Using \(\arctan {\left( u\right) } + \arctan {\left( v\right) } = \arctan {\left( \frac{u+v}{1-uv}\right) }\) provided \(uv \ne 1\) and the result is taken modulo \(\pi \) we have that the above probability is now:
We see that the numerator is equal to \({d_{ij, {1}}}\) and the denominator equal to \({d_{ij, {2}}}\). Then, using \(\arctan {\left( x\right) } = 2 \arctan {\left( \frac{x}{1+\sqrt{1 + x^{2}}}\right) }\), we can rewrite the above and simplify to obtain:
\(\square \)
About this article
Cite this article
Jones, B., Artemiou, A. On principal components regression with Hilbertian predictors. Ann Inst Stat Math 72, 627–644 (2020). https://doi.org/10.1007/s10463-018-0702-9
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-018-0702-9