Functional Principal Components Analysis of Spatially Correlated Data

This paper focuses on the analysis of spatially correlated functional data. The between-curve correlation is modeled by correlating functional principal component scores of the functional data. We propose a Spatial Principal Analysis by Conditional Expectation framework to explicitly estimate spatial correlations and reconstruct individual curves. This approach works even when the observed data per curve are sparse. Assuming spatial stationarity, empirical spatial correlations are calculated as the ratio of eigenvalues of the smoothed covariance surface $Cov(X_i(s),X_i(t))$ and cross-covariance surface $Cov(X_i(s), X_j(t))$ at locations indexed by $i$ and $j$. Then a anisotropy Mat\'ern spatial correlation model is fit to empirical correlations. Finally, principal component scores are estimated to reconstruct the sparsely observed curves. This framework can naturally accommodate arbitrary covariance structures, but there is an enormous reduction in computation if one can assume the separability of temporal and spatial components. We propose hypothesis tests to examine the separability as well as the isotropy effect of spatial correlation. Simulation studies and applications of empirical data show improvements in the curve reconstruction using our framework over the method where curves are assumed to be independent. In addition, we show that the asymptotic properties of estimates in uncorrelated case still hold in our case if 'mild' spatial correlation is assumed.


Introduction
Functional data analysis (FDA) focuses on data that are infinite-dimensional, such as curves, shapes and images. Generically, functional data are measured over a continum across multiple subjects. In practice, many data such as growth curves of different people, gene expression profiles, vegetation index across multiple locations, vertical profiles of atmospheric radiation recorded at different times, etc. could naturally be modeled by FDA framework.
Functional data are usually modeled as noise corrupted observations from a collection of trajectories that are assumed to be realizations of a smooth random function of time X(t), with unknown mean shape µ(t) and covariance function Cov(X(s), X(t)) = G(s, t). The functional principal components (fPCs) which are the eigenfunctions of the kernel G(s, t) provide a comprehensive basis for representing the data and hence are very useful in problems related to model building and prediction of functional data.
Let φ k (t), k = 1, 2, · · · , K and λ k , k = 1, 2, · · · , K be the first K eigenfunctions and eigenvalues of G(s, t). Then where ξ ik are fPC scores which have mean zero and variance λ k . According to this model, all curves share the same mode of variations, φ k (t), around the common mean process µ(t).
A majority of previous work in FDA assume that the realizations of the underlying smooth random function are independent. There exists an extensive literature on functional principal components analysis (fPCA) for this case. For data observed at irregular grids, Yao et al. (2003) and Yao and Lee (2006) used local linear smoother to estimate the covariance kernel and integration method to compute fPC scores. However, the integration approximates poorly with sparse data. James and Sugar (2003) proposed B-splines to model the individual curves through mixed effects model where fPC scores are treated as random effects. For sparsely observed data, Yao et al. (2005) proposed a framework called "PACE" which stands for Principal Analysis of Condition Expectation. In PACE, fPC scores were estimated by their expectation conditioning on available observations across all trajectories. To estimate fPCs: a system of orthogonal functions, Peng and Paul (2009) proposed a restricted maximum likelihood method based on a Newton-Raphson procedure on the Stiefel manifold. Hall et al. (2006) and Li and Hsing (2010) gave weak and strong uniform convergence rate of the local linear smoother of the mean and covariance, and the rate of derived fPC estimates.
The PACE approach works by efficiently extracting the information on φ k (t) and µ(t) even when only a few observations are made on each curve as long as the pooled time points from all curves are sufficiently dense. Nevertheless, PACE is limited by its assumption of independent curves. In reality, observations from different subjects are correlated. For example, it is expected that expression profiles of genes involved in the same biological processes are correlated; and vegetation indices of the same land cover class at neighboring locations are likely to be correlated.
There has been some recent work on correlated functional data. Li et al. (2007) proposed a kernel based nonparametric method to estimate correlation among functions where observations are sampled at regular temporal grids and smoothing is performed across different spatial distances. Moreover, it was assumed in their work that the covariance between two observations can be factored as the product of temporal covariance and spatial correlation, which is referred to as separable covariance. Paul and Peng (2010) discussed a nonparametric method similar to PACE to estimate fPCs and proved that the L 2 risk of their estimator achieves optimal nonparametric rate under mild correlation regime when the number of observations per curve is bounded. Zhou et al. (2010) presented a mixed effect model to estimate correlation structure, which accommodates both separable and non-separable structures.
In this paper, we develop a new framework which we call SPACE (Spatial PACE for modeling correlated functional data. In SPACE, we explicitly model the spatial correlation among curves and extend local linear smoothing techniques in PACE to the case of correlated functional data. Our method differs from Li et al. (2007) in that sparsely and irregularly observed data can be modeled and it is not necessary to assume separable correlation structure.
In fact, based on our SPACE framework, we proposed hypothesis tests to examine whether or not correlation structure presented by data is separable or not. Specifically, we model the correlation of fPC scores s ik across curves by anisotropiv Matérn family. In the anisotropy Matérn correlation model (Haskard and Anne, 2007), we rotate and stretch the axis such that equal correlation contour is a tilted ellipse to accommodate anisotropy effect which often arises in geoscience data. In our model, anisotropy Matérn correlation is governed by 4 parameters: α, δ, κ, φ where α controls the axis rotation angle and δ specifies the amount of axis stretch. SPACE identifies a list of neighborhood structures and applies local linear smoother to estimate a cross-covariance surface for each spatial separation vector. An example of neighborhood structure could be all pairs of locations which are separated by distance of one unit and are positioned from southwest to northeast.
In particular, SPACE estimates a cross-covariance surface by smoothing empirical covariances observed at those locations. Next, empirical spatial correlations are estimated based on the eigenvalues of those cross-covariance surfaces. Then, anisotropy Matérn parameters are estimated from the empirical spatial correlations. SPACE directly plugs in the fitted spatial correlation model into curve reconstruction to improve the reconstruction performance relative to PACE where no spatial correlation is modeled.
We demonstrate SPACE methodology using simulated functional data and Harvard Forest vegetation index discussed in Liu et al. (2012). In simulation studies, we first examine the estimation of SPACE model components. Then we perform the hypothesis tests of separability and isotropy effect. We show that curve reconstruction performance is improved using SPACE over PACE. Also, hypothesis tests demonstrate reasonable sizes and powers.
Moreover, we construct semi-empirical data by randomly removing observations to achieve sparseness in vegetation index at Harvard Forest. Then it is shown that SPACE restores vegetation index trajectories with less errors than PACE.
The rest of the paper is organized as follows. Section 2 describes the spatially correlated functional data model. Section 3 describes the SPACE framework and model selections associated with it. Then we summarize the consistency results of SPACE estimates in Section 4 and defer more detailed discussions to Appendix A. Next, we propose hypothesis tests based on SPACE model in Section 5. Section 6 describes simulation studies on model estimations, followed by Section 7 which presents curve construction analysis on Harvard Forest data. In the end, conclusion and comments are given in Section 8.

Correlated Functional Data Model
In this section, we describe how we incorporate spatial correlation into functional data and introduce the Matérn class which we use to model spatial correlation.

Data Generating Process
We start by assuming that data are collected across N spatial locations. For location i, a number of n i noise-corrupted points are sampled from a random trajectory X i (t), denoted by Y i (t j ), j = 1, 2, · · · , n i . These observations can be expressed by an additive error model as the following, (2.1) are assumed to be iid with variance σ 2 across locations and sampling times. The random function X i (t) is the ith realization of an underlying random function X(t) which is assumed to be smooth and square integrable on a bounded and closed time interval T . Note that we refer to the argument of function as time without loss of generality. The mean and covariance functions of X(t) are unknown and denoted by µ(t) = E(X(t)) and G(s, t) = Cov(X(s), X(t)). By the Karhunen-Loève theorem, under suitable regularity conditions, there exists an eigen-decomposition of the covariance kernel where {φ k (t)} ∞ k=1 are orthogonal functions in the L 2 sense which we also call functional principal components (fPC), and {λ k } ∞ k=1 are associated non-increasing eigenvalues. Then, each realization X i (t) has the following expansion, where for given i, ξ ik 's are uncorrelated fPC scores with variance λ k . Usually, a finite number of eigenfunctions are chosen to achieve reasonable approximation. Then, In classical functional data model, X i (t)'s are independent across i and thus cor(ξ ik , ξ jk ) = 0 for any pair of different curves {i, j} and for any given fPC index k. However, in many applications, explicit modeling and estimation of the spatial correlation is desired and can provide insights into subsequent analysis. To build in correlation among curves, we assume ξ ik 's are correlated across i for each k. One could specify full correlation structure among ξ ik 's by allowing non-zero covariance between scores of different fPCs, e.g. Cov(ξ ip , ξ jq ) = 0.
Though the full structure is very flexible, it is subject to the risk of overfitting and thus its estimation can be intractable. To achieve parsimony, we assume the following where ρ ij (k) measures the correlation between kth fPC scores at curve i and j. Denoting , · · · , φ K (t)) T and retaining the first K eigenfunctions as in (2.4), then the covariance between X i (s) and X j (t) can be expressed as . Note the diagonal elements are not necessarily sorted by value. If we further assume the between-curve correlation ρ ij (k) does not depend on k, i.e. ρ ij (k) = ρ ij , then Cov(ξ i ξ T j ) = ρ ij diag(λ 1 , · · · , λ K ). In this case the covariance can be further simplified as If the covariance between X i (s) and X j (t) can be decomposed into a product of spatial and temporal components as in (2.8), we refer to this covariance structure as separable. Separable covariance structure of the noiseless processes assumes that the correlation across curves and across time are independent of each other. One example of this type of processes is the weed growth data studied in Banerjee and Johnson (2006) where curves are weed growth profiles at different locations in the agricultural field. Separable covariance is a convenient assumption which makes estimation easier. It is a restrictive assumption and in order to examine whether this assumption is justifiable, we propose a hypothesis test which is described in Section 5.
Irrespective of the separability of covariance, spatial correlations among curves are reflected through the correlation structure of fPC scores ξ. For each fPC index k, the associated fPC scores at different locations can be viewed as a spatial random process.

Matérn Class
We choose the Matérn class for modeling spatial correlation. It is a widely used class as a parametric model in geoscience. The Matérn family is attractive due to its flexibility. Specifically, the Matérn correlation between two observations at locations separated by distance d > 0 is given by where K ν (·) is the modified Bessel function of the third kind of order ν > 0 described in Abramowitz and Stegun (1970). This class is governed by two parameters, a range parameter ζ > 0 which rescales the distance argument and a smoothness parameter ν > 0 that controls the smoothness of the underlying process.
The Matérn class is by itself isotropic which has contours of constant correlation that are circular in two-dimensional applications. However, isotropy is a strong assumption and thus limit the model flexibility in some applications. Liu et al. (2012) showed that fPC scores present directional patterns which might be associated with geographical features. Specifically, the authors look at vegetation index series at Harvard Forest in Massachusetts across a 25 × 25 grid over 6 years. Using the same data, we calculate the correlation between fPC scores at locations separated by 45 degree to the northeast and to the northwest respectively. easily incorporated into the Matérn class by applying a linear transformation to the spatial coordinates. To this end, two additional parameters are required: an anisotropy angle α which determines how much the axes rotate clockwise, and an anisotropy ratio δ specifying how much one axis is stretched or shrunk relative to the other. Let (x 1 , y 1 ) and (x 2 , y 2 ) be coordinates of two locations and denote the spatial separation vector between these two loca- where we write the rotation and rescaling matrix by R and S respectively. Define the non-Euclidean distance function as d * (∆, α, δ) = Hence, the anisotropy correlation is computed as Note ρ * (∆) = ρ * (−∆). Let ∆ ij be the spatial separation vector of locations between curve i and j. Then we model the covariance structure of fPC scores described in (2.5) as (2.12) Note that the above parametrization is not identifiable. Firstly, α and α + π always give the same correlation. Secondly, the pair of any α and δ gives the exact same correlation with the combination of α + π/2 and 1/δ. We remove the non-uniqueness of model by adding additional constraints on the ranges of parameters.

SPACE Methodology
SPACE methodology extends the PACE methodology introduced by Yao et al. (2005) and the methodology discussed in Li et al. (2007). Among the components of SPACE, the mean function µ(t) and the measurement error variance σ 2 in (2.1) are estimated by the same method used in Yao et al. (2005). In this section, we introduce the estimation of the other key components of SPACE: the cross-covariance surface Cov(X i (s), X j (t)) for any location pair (i, j) in (2.6) and the anisotropy Matérn model among fPC scores in (2.12). We will also describe methods of curve reconstruction and model selection.

Cross-Covariance Surface
Let us define G ij (s, t) = Cov(X i (s), X j (t)) as the cross covariance surface between location i and j. Letμ(t) be an estimated mean function, and then let ) be the raw cross covariance based on observations from curve i at time t ik and curve j at time t jl . We estimate G ij (s, t) by smoothing D ij (t ik , t jl ) using a local linear smoother.
In this work, we assume the second order spatial stationarity of the fPC score process.
to be the collection of location pairs that share the same spatial correlation structure. Then, all location pairs that belong to N (∆) are associated with the same unique covariance surface which we write as G ∆ (s, t). As a result, all raw covariances constructed based on locations in N (∆) can be pooled together to estimate G ∆ (s, t). In addition, we note is 1 if i = j and s = t, and 0 otherwise. If i = j, the problem reduces to the estimation of covariance surface and we apply the same treatment described in Yao et al. (2005) to deal with the extra σ 2 on the diagonal. For i = j and a given spatial separation vector ∆, the local linear smoother of the cross covariance surface G ∆ (s, t) is derived by with respect to β 0 , β 1 and β 2 . κ 2 is the two-dimensional Gaussian kernel. Letβ 0 ,β 1 andβ 2 be minimizers of (3.2). Then G ∆ (s, t) =β 0 . For computation, we evaluate one-dimensional functions over equally-spaced time points t eval = (t eval 1 , · · · , t eval M ) T with step size h between two consecutive points. We evaluate G ∆ (s, t) over all possible two-dimensional grid points constructed from t eval , denoted by t eval × t eval . Let G ∆ (t eval × t eval ) be the evaluation matrix across all grid points. The estimates of eigenfunction φ k (t) and λ k are derived as the eigenvectors and eigenvalues of G ∆ (t eval × t eval ) adjusted for the step size h.
In some cases the number of elements in N (∆) is very limited. For example, if observations are collected from irregular and sparse locations, it is relatively rare for two pair of locations to have exactly the same spatial separation vector. More location pairs can be included by working with a sufficiently small neighborhood around a given ∆. Define

Anisotropy Matérn Model
Now we focus on estimating the parameters of the Matérn model. We will first estimate the empirical correlation ρ k (∆) for the cross-covariance surfaces and estimate the parameters of the Matérn model by fitting them to the empirical correlations. Equation (2.12) specifies the spatial covariance among fPC scores. Let λ k (∆) be the kth eigenvalue of G ∆ (s, t). For covariance surface, we use λ k ((0, 0)) and λ k interchangeably. Then we have Thus for all ∆ > 0, ρ * k (∆) can be estimated as the ratio of kth eigenvalues of G ∆ (s, t) and G (0,0) (s, t), which can be written aŝ are used to fit (2.11) and to estimate parameters α, δ, ζ, ν. If assuming separable covariance structure, empirical correlations could be pooled across k to estimate parameters of the When fitting (2.11), the sum of squared difference between empirical and fitted correlations over all ∆ i 's is minimized through numerical optimization.

Curve Reconstruction
Reconstructing trajectories is an important application of the SPACE model. Curve reconstructions based on SPACE model also provides an alternative perspective of "gap-filling" the missing data for geoscience applications as well. Equation (2.4) specifies the underlying formula used to reconstruct the trajectory X i (t) for each i. {φ k (t)} K k=1 andμ(t) can be derived through the process described in previous sections. The only missing element now is The best linear unbiased predictors (BLUP) (Henderson, 1950) of ξ ik are given byξ To describe the closed-form solution to equation (3.6) and to facilitate subsequent discussions, we introduce the following notations.
. Note diagonalization and transpose are performed before substitution in all above notations. If assuming separable covariance, we write ρ = ρ * ij . In addition, define Σ(A, B) as the covariance matrix between A and B, and Σ(A) as the variance matrix of A. Then where ⊗ represents Kronecker product. With Gaussian assumptions, we havě where the last line follows the Woodbury matrix identity. For cases where N i=1 n i > N K, the transformation of last line suggests a way to reduce the size of matrix to be inverted.
The separability assumption simplifies not only the model itself but the calculation of matrix inverse as well, noting that Σ( ξ, ξ) −1 = (ρ ⊗ Λ) −1 = ρ −1 ⊗ Λ −1 . By substituting all elements in (3.8) with corresponding estimates, the estimate of ξ is derived as The reconstructed trajectory is then given by

Model Selection
Rice and Silverman (1991)  down. LOBO method achieves better balance between those two competing forces. We leave a more detailed investigation of this phenomenon to future work.
To determine the number of eigenfunctions K which sufficiently approximate the infinite dimensional process, we rely on LCO J of curve reconstruction error which is denoted by ErrCV J for convenience. Specifically, all curves are partitioned into J complementary groups.
Curves in each group serve as the testing sample. Then we select the training sample as curves that are certain distance away from testing curves to reduce the spatial correlation between the training and testing samples. For each fold, testing curves are reconstructed based on parameters estimated by corresponding training curves. Denote reconstructed curves of the jth fold by X −j . Then, we compute ErrCV J as follows, (3.13) Zhou et al. (2010) pointed out that cross-validation scores may keep decreasing as model complexity increases, which is also observed in our simulation studies. The 5-fold crossvalidation score, ErrCV 5 , as a function of K is illustrated in Figure 2. A quick drop is observed followed by a much slower decrease. Instead of finding the minimum ErrCV 5 , we visually select a suitable K as the kink of ErrCV 5 profile. In Figure 2, the largest drop of CV scores takes place at the correct value K = 2 for 100% times out of the 200 replicated data sets. When estimating model parameters, we use two spatial separation vectors ∆1 = (1, 0) and ∆2 = (2, 0).
ErrCV5 is computed over 200 simulated data sets with number of eigenfunctions K = 1, 2, 3, 4. Median ErrCV5 is represented by solid line and 5% and 95% percentiles are plotted in dashed lines.
When fitting anisotropy Matérn model, {∆ i } m i=1 , the set of neighborhood structures needs to be determined. Consider the example of one-dimensional and equally-spaced locations where the ith location has coordinates (i, 0). Larger spatial separation vector corresponds to fewer pairs of locations in N (∆). Thus we often start with the most immediate neighborhood In simulation studies not reported here, we found that ErrCV J is not sensitive with respect to m, which is especially true when spatial correlation is low. Instead of selecting an "optimal" m, we use a range of different m's and take the trimmed average of estimates derived across m. The maximum m can be selected so that the spatial correlation is too low to have meaningful impact based on prior knowledge. In simulation studies of Section 6, all estimations are made based on 20%

Asymptotic Properties of Estimates
Assuming no spatial correlation, Yao et al. (2005)  in their work that if the correlation between sample curves is "weak" in a appropriate sense, then for their estimators, the optimal rate of convergence in the correlated and i.i.d. cases are the same. Based on the framework of proof in Yao et al. (2005) and by introducing two sufficient conditions, we are able to extend results in Yao et al. (2005) and show the consistency ofμ(t) and cross-covariance function G ∆ (s, t) in the following theorem.
The proof of Theorem 1* and more discussions can be found in Appendix A, B and C. In appendices, we first review the main theorems and their proofs discussed in Yao et al. (2005).
Then, we highlight necessary modifications to accommodate the spatial dependence.

Separability and Isotropy Tests
Separability of spatial covariance is a convenient assumption under which Cov(ξ i , ξ j ) is simply a rescaled identity matrix and thus a parsimonious model could be fitted. However, we would like to design a hypothesis test to examine if this assumption is valid and justified by the data.
Spatial correlation could depend not only on distance but angle as well. Whether correlation is isotropic or not may be an interesting question for researchers and is informative for subsequent analysis. In this section, we propose two hypothesis tests to address these issues based on SPACE model.

Separability Test
Recall the correlation matrix of the kth fPC scores among curves is denoted by ρ k . We model ρ k through anisotropy Matérn model through parameters α k , δ k , ζ k and ν k . Suppose we use K eigenfunction to approximate the underlying process. Then we partition the set {1, 2, · · · , K} into A mutually exclusive and collectively exhaustive groups K = {K 1 , · · · , K A }. Denote the number of k's in group K i by |K i |. A generic hypothesis can be formulated in which parameters across k's are assumed to be the same within each group but can be different between groups. Consider a simple example where we believe correlation structures are the same among the first three fPC scores versus the hypothesis that correlations are all different.
Then the partition associated with the null hypothesis is K 0 : The partition for alternative hypothesis is K 1 : Both the null and alternative can take various forms of partitions.
However, for illustration purpose, we consider the following test where no constraints are imposed in the alternative: H 0 : α ir = α jr , δ ir = δ jr , ζ ir = ζ jr , ν ir = ν jr , ∀i r , j r ∈ K r , ∀r s.t. |K r | > 1 (5.1) The key step of the test is to construct hypothesized null curves from observed data. If step 1. Estimate model parameters assuming they are arbitrary. Denote the estimates bŷ σ 2 , φ(t), λ and θ k . Then compute the observed test statistics S using these estimates.
Note ξ in (3.11) is arranged by curve index i. We rearrange it by fPC index k and let ξ k be the vector of estimated kth fPC scores across all locations.
step 3. Estimate θ k assuming H 0 . Specifically, for each r, fit anisotropy Matérn model with step 4. For each k whose associated K r has more than one indices, let ρ k be the spatial correlation matrix constructed using θ 0 k . Suppose ρ k has eigen-decomposition VCV T .
Then Cov(ξ k ) =λ k VCV T . Define transformation matrix T = (λ k C) −1/2 V T and the transformed scores z k = T ξ k . Then Cov( z k ) = I N ×N .
step 5. Resample z k 's through curve-wise bootstrapping. Specifically, randomly sample the curve indices {1, · · · , N } with replacement and let P (i) be the permutated index for curve i. Then the bth bootstrapped score for curve i is obtained asẑ b ik =ẑ P (i)k for all k's such that |K r | > 1. step 7. Generate the bth set of Gaussian random noises { b (t ij )} N,n i i=1,j=1 based on the noise standard deviationσ 2 estimated in step 1.
step 8. Construct resampled observations as below step 10. Repeat steps 5 to 9 for B times and obtain {S b } B b=1 which form the empirical null distribution of the test statistics. Then make decision by comparing the null distribution with the observed test statistics S.
Note that an alternative and theoretically more precise way of doing step 4 is to use covariances constructed based on θ k as opposed to θ 0 k . The alternative way is assumed to remove spatial correlation more thoroughly but simulation results not reported suggest that it introduces more volatility into z-scores that offsets the benefit of lower residual spatial correlation in them. For testing the H 0 above, we transform θ k and θ b k from parameter space to correlation space. Letρ * r (∆ eval ; θ k ) = k∈Kr ρ * (∆ eval ; θ k )/|K r | be the average correlation computed at separation vector ∆ eval across k's in K r . Then define test statistics as We tried other statistics and found that statistics in the correlation space works generally better than that in the parameter space. In our implementations, we choose ∆ eval to be the most immediate neighborhood in Euclidean distance.

Isotropy Test
Isotropy test is essentially the same as the separability test. To make the test more general, we assume a prior that correlation structures are the same across k's within K r for r = 1, 2, · · · , A. Note that within each K r , α k 's and δ k 's must take equal values across k's. Let K a = {K r 1 , · · · , K ra } be the set whose α k 's are all assumed to be zeros. Then let K c a be the complement set of K a where 1 ≤ a ≤ A. To test the following hypothesis H 0 : ∃K a = {K r 1 , · · · , K ra } s.t. α k = 0, ∀k ∈ K r i , i = 1, · · · , a (5.6) we propose a similar procedure which slightly differs from the separability test in step 1, step 3, step 7 and step 8. Specifically, we replace those steps in separability tests with the following.
step 1*. Estimate model parameters under the prior. In particular, θ k is obtained by pooling empirical correlations from all k's within K r . Then compute S based on these estimates.
step 3*. Under the null hypothesis, we estimate anisotropy Matérn parameters. For any k ∈ K r that belongs to K a ,α 0 k andδ 0 k are fixed at zero and one respectively whereaŝ ζ 0 k andν 0 k are estimated from {∆ l ,ρ * k (∆ l ), ∀k ∈ K r } L l=1 . Then θ 0 k = (0, 1,ζ 0 k ,ν 0 k ) for any k ∈ K r that belongs to K a . For any k ∈ K r that belongs to K c a , no extra estimation is needed. We keep fPC scores estimated in step 2 for those k's and perform no transformation on them.
step 8*. Construct resampled curves as below

Model Estimation
We examine the estimation of SPACE model in both one-dimensional and two-dimensional cases. In one-dimensional case, we are interested in the estimation of ζ whereas the estimation of α is our main focus in two-dimensional case. Results are summarized in Table 1 and 2.
Note the first order derivative of Matérn correlation with respect to ζ is flattened as the correlation approaches 1. Thus more extreme large estimates of ζ are expected, which leads to the positive skewness observed in Table 1. We also look at the estimation performance in the correlation space. Specifically, we examine the estimated correlation at ∆ eval = (0, 1).
Estimation is more difficult in two-dimensional case where two more parameters α and δ need to be estimated. In general, estimates based on more significant fPCs have better quality in terms of root-mean-squared error (RMSE). To assess the curve reconstruction performance, consider a grid of time points for function evaluation t eval = (t eval 1 , t eval 2 , · · · , t eval n ).
)) 2 be the curve reconstruction error. Then define the reconstruction improvement (IP) as IP = log(pErr PACE Err SPACE ). Out of 100 simulated data sets, we calculate the percentage of IP greater than 0. The improvement of SPACE over PACE is more prominent in scenarios with larger noise and higher spatial correlation. It is easy to show that when σ = 0 and the number of eigenfunctions is greater than the maximum number of observations per curve, SPACE produces exactly the same reconstructed curves as PACE. If noise is large, information provided by each curve itself is more contaminated by noise relative to neighboring locations which provide more useful guidance in curve reconstruction. As spatial correlation increases, observations at neighboring locations are more informative.

Hypothesis Test
We evaluate the hypothesis tests proposed in Section 5. Separability and isotropy tests are implemented based on simulated data sets on one-dimensional and two-dimensional locations respectively. In each test, 100 curves are created in each data set and 25 data sets are generated. We first focus on separability test by examining different statistics: absolute difference in ζ 1 and ζ 2 , difference of spatial correlation at ∆ eval = (0, 1) and absolute difference of spatial correlation at ∆ eval = (0, 1). Next, we test the isotropy effect assuming separability and the test statistics is simplyα. Each alternative test is evaluated at a single set of parameters. All settings are described in Table 3 and 4. With nominal size set to 5%, the two-sided empirical sizes and powers are summarized in Table 3 and 4. Both tests deliver reasonable sizes and powers.

Harvard Forest Data
SPACE model is motivated by the spatial correlation observed in the Harvard Forest vegetation index data described in Section 2.2 and observed in Liu et al. (2012). In this section,   We first focus on verifying the directional effect observed in the second fPC scores through the proposed isotropy test. The Harvard Forest EVI data is observed over a dense grid of regularly spaced time points. EVI observed at each individual location is smoothed using regularization approach based on the expansion of saturated Fourier basis. Specifically, the sum of squared error loss and the penalization of total curvature is minimized. Let x i (t) be the smoothed EVI at the ith location and θ k (t) be the kth basis function. Define the N × 1 vector-valued function x(t) = (x 1 (t), · · · , x N (t)) T and the K × 1 vector-valued function θ(t) = (θ 1 (t), · · · , θ K (t)) T . All N curves can be expressed as x(t) = Cθ(t) where C is the coefficient matrix of size N × K. The functional PCA problem reduces to the multivariate PCA of the coefficient array C. Assuming K < N and let U be the K × K matrix of eigenvectors of C T C. Let φ(t) = (φ 1 (t), · · · , φ K (t)) T be the vector-valued eigenfunction  The smoothing method is described in Liu et al. (2012) and more general introduction can be found in Ramsay and Silverman (2005). It has been shown that smoothing of individual curves and covariance surface are asymptotically equivalent. When applying the isotropy test proposed in Section 5.2 to Harvard Forest EVI, we swap the estimation of fPC scores and anisotropy Matérn parameters. In particular, we estimate fPCs and associated scores of hypothesized null curves through the process described above. Next, we estimate anisotropy Matérn parameters based on the empirical correlation calculated from estimated fPC scores.
Following the proposed procedure, we only resample the second fPC scores as they present potential anisotropy effect. For simplicity, we construct noiseless hypothesized curves so resmoothing is not needed. The null distribution of the estimated anisotropy angle α is shown in Figure 3. The result suggests the rejection of isotropy effect and confirms the diagonal pattern in fPC scores. Next, we apply SPACE framework to the gap-filling of Harvard Forest EVI data. To that end, we create sparse samples by randomly selecting 5 observations from each location and each year. 100 sparse samples are created. To make the estimation and reconstruction of EVI across 625 pixels more computationally tractable, both SPACE and PACE are performed in each year respectively. The distribution of IP in year 2005 is summarized in Figure 3. 100% of the 100 samples show improved reconstruction performance using SPACE. Similar results are also achieved for other years. By incorporating spatial correlation estimated from EVI data, better gap-filling can be achieved.

Conclusion
Much of the literature in functional data analysis assumes no spatial correlation or ignoring spatial correlation if it is mild. We propose the spatial principal analysis based on conditional expectation (SPACE) to estimate spatial correlation of functional data, using non-parametric smoothers on curves and surfaces. We show that the leave-one-bin-out cross-validation based on binned data performs well in selecting bandwidth for local linear smoothers. Empirical spatial correlation is calculated as the ratio of eigenvalues of cross-covariance and covariance surfaces.
A parametric model, Matérn correlation augmented with anisotropy parameters, is then fitted to empirical spatial correlations at a sequence of spatial separation vectors. With finite sample, estimates are better for separable covariance than non-separable covariance.
The fitted anisotropy Matérn parameters can be used to compute the spatial correlation at any given spatial separation vector and thus are used to reconstruct trajectories of sparsely observed curves.
This work compares with the work in Yao et al. (2005) where curves are assumed to be independent. We show that by incorporating the spatial correlation, reconstruction performance is improved. It is observed that the higher the noise and true spatial correlations, the greater the improvements. We demonstrate the flexibility of SPACE model in modeling the separability and anisotropy effect of covariance structure. Moreover, two tests are proposed as well to explicitly answer if covariance is separable and/or isotropy. Resonable empirical sizes and powers are obtained in each test.
Then we apply the SPACE method to Harvard Forest EVI data. In particular, we confirm the diagonal pattern observed in the second fPC scores through a slightly modified version of the proposed isotropy test. Moreover, we demonstrate that by taking into account explicitly the spatial correlation, SPACE is able to produce more accurate gap-filled EVI trajectory on average.
In the end, we also present a series of asymptotic results which demonstrate the consistency of model estimates. In particular, the same convergence rate for correlated case as that of i.i.d. case is derived assuming mild spatial correlation structure.

A More on Consistency
This section discusses modifications and improvements of results in Yao et al. (2005) in order to incorporate spatial correlation. Theorem 1* of Section 4 extends the corresponding result of Yao et al. (2005) and serves as the foundation of our work. In subsequent discussions, we change some notations in Yao et al. (2005) to accommodate existing ones in our work. To avoid confusions, we make the following clarifications. We denote the number of curves by capital N whereas Yao et.al.
used n. For the number of observations on curve i, we use n i whereas Yao et.al. used N i . Yao et.al. assumed N i follows the distribution of a random variable N which is denoted by N * in our work. We denote all variables which depend on the BLUP of fPC scores by check sign instead of tilde used by Yao et.al. We use the same notations as those in Yao et al. (2005) for other variables unless stated otherwise. Yao et al. (2005) states five theorems together with a series of lemmas and corollaries to describe the convergence of PACE estimators. Theorem 1 establishes the uniform convergence rates in probability of local linear estimates of mean µ(t) and covariance function G(s, t). Corollary 1 states the convergence rate in probability of measurement error variance σ 2 and is derived by applying the results in Theorem 1. Theorem 2 establishes the convergence in probability of the eigen-structure of the covariance kernel, which includes the convergence of eigenvalues λ k of multiplicity 1, L 2 and uniform convergence of eigenfunctions φ k (t)'s associated with λ k of multiplicity 1. Based on the results in Theorem 1 and 2, derived in Theorem 3 are the convergence of estimated fPC scoreξ ik towards the conditional expectation of true fPC score (BLUP)ξ ik , and the pointwise convergence of reconstructed curves X K,i (t) based on K-dimensional approximation towards its associated infinite dimensional . With Gaussian assumption, Theorem 4 establishes the pointwise asymptotic distribution of the standardized deviance between reconstructed curve X K,i (t) and true curve X i (t). In Theorem 5, simultaneous confidence region of X K,i (t) over T in terms of reconstructed curves X K,i (t) is established. Finally, the simultaneous region of K-dimensional vector ξ K,i in terms of estimated fPC score vector ξ K,i is derived in Corollary 2. As the cornerstone of all other derived results, Theorem 1 is derived based on the results in Lemma 1 and 2. These two lemmas establish the uniform convergence rates for the weighted average Ψ pN (t) of function ψ p (T ij , Y ij ) and kernel function κ 1 ((t − T ij )/h µ ) for one-dimensional smoother, and the weighted average Θ pN (t, s) smoother. These weighted averages are the building blocks of the closed form expressions of the local linear smoothers. For example, the local constant estimator of µ t can be expressed as The above results are only valid based on the assumption of a series of conditions. Yao et al. (2005) describes three groups of conditions named after letters A,B and C. Some more general assumptions are also made. Specifically, both time point T ij and the observed value of curve i made at this time , are random variables. T ij is assumed to follow the same marginal distribution T with density f (t), data pair (T ij , Y ij ) is assumed to follow the same joint distribution (T, Y ) with density g(t, y), for any given curve i and time index j. The random vector (T ij , T il , Y ij , Y il ) for j = l and any i is assumed to follow the same distribution (T 1 , T 2 , Y 1 , Y 2 ) with density g(t 1 , t 2 , y 1 , y 2 ). In addition, T ij is assumed to be independent across all i and j, and both (T ij , Y ij ) and (T ij , T il , Y ij , Y il ) are independent across curve index i.
Conditions series (A1) pertains to the number of observations per curve. It is assumed to be a discrete random variable following the distribution N * . N * is assumed to have finite mean and takes value greater than 1 with positive probability, and is independent of all T ij 's and Y ij 's. Series assumes that the data asymptotically follow a linear scheme. Condition (A7) assumes the pointwise limit of φ K,s Σ(ξ K,i − ξ K,i )φ K,t φ K,s Ω K φ K,t as K → ∞ exists such that Theorem 4 and 5 hold.
Category B specifies properties of density and smoothing kernel functions for Lemma 1 and 2 to hold.
Category C assumes favorable continuity and boundedness properties of functions ψ p and θ p crucial for Lemma 1 and 2.
To reduce notational confusion between v and ν in Yao et al. (2005), we denote the order of kernel functions κ 1 (t) and κ 2 (t, s) by (a, b) instead of (ν, l) which is used in Yao et al. (2005). We review two lemmas and five theorems in more details below. The following review provides a skeleton of proofs in Yao et al. (2005), which serves as a scaffold to facilitate the introduction of our results. Please refer to Yao et al. (2005) for more details regarding proofs for the i.i.d. case.
Proofs in Yao et al. (2005) refer to Slutsky's theorem in many occasions. However, we found this reference is not precise because the statement of Slutsky's theorem as the widely known version does not involve convergence rate and big O notation as Yao et al. (2005) have assumed and suggested.
Since it is heavily used throughout the proofs, we state this auxiliary result in a separate lemma, Lemma 4, indexed after the existing lemmas in Yao et al. (2005).
where c n → 0 as n → ∞, and g(u) ∈ C 2 on R d for any d > 0. Then Proof of this lemma is deferred to Appendix B. As a special case, a stronger rate can be derived and a n /b n → 0 as n → ∞. Then it is easy to show that sup t |V Recall that in Yao et al. (2005) is assumed to follow the distribution of random vector (T 1 , T 2 , Y 1 , Y 2 ) with density function g(t 1 , t 2 , y 1 , y 2 ). Accordingly, for each spatial separation vector ∆ and any location pairs (i, j) ∈ N (∆), we assume (T ik , T jl , Y ik , Y jl ) follows the distribution of (T 1 , T 2 , Y 1 , Y 2 ) with density function g ∆ (t 1 , t 2 , y 1 , y 2 ). We assume the same regularity conditions on g ∆ (t 1 , t 2 , y 1 , y 2 ) as those in Yao et al. (2005). As mentioned earlier, Lemma 1 and 2 in Yao et al. (2005) are the foundations and rely on the assumption of zero spatial correlation. Modifications due to the incorporation of spatial dependence start from these two lemmas.
Lemma 1 * , as the counterpart of Lemma 1 in Yao et al. (2005), states the uniform convergence of weighted averages Ψ pN (t) for one-dimensional case. As pointed out above, the key step of achieving the stated rate is to show that V ar(ϕ pN (u)) = O(1/N ). With the presence of spatial correlation, we have, The first term in the last line is 1≤i =j≤N Cov(Z i , Z j )/N 2 is the extra term if we introduce spatial correlations. Then requiring 1≤i =j≤N Cov(Z i , Z j )/N 2 = O(1/N ) is a straight forward way for V ar(ϕ(u)) to remain O(1/N ). Note there are N (N − 1) items in the summation of extra term, we would expect the correlation between Z i and Z j to decay fast enough as locations i and j are further apart. In a sum, to achieve a desired convergence rate, off-diagonal entries of spatial correlation matrices need to decrease at a proper pace when getting away from the diagonal. Within the current context, we introduce condition (D1) as follows: Lemma 2 * is the two-dimensional counterpart of Lemma 1 * . Note the local linear smoother of Accordingly, we change the definition of weighted average Θ pN (s, t) as Note the outermost layer of summation is taken over all location pairs in N (∆). We thus change all N that accompanied by h G with |N (∆)| in the lemmas, theorems, corollaries and conditions. In particular, affected conditions include (A2.2) and (C2.1b) in Yao et al. (2005). To achieve the same rate in Lemma 2 of Yao et al. (2005), we derive the following upper bound for the variance of γ pN (u, v).
Theorem 3 * also includes new conditions (D1) and (D2) in addition to the existing ones in Theorem 3 of Yao et al. (2005). The major difference lies in the proof. Specifically, we estimate all fPC scores ξ in one conditional expectation as in equation (3.11). For the ith curve and any integer K ≥ 1, let H K =ˇ ξ = Σ( ξ) − Σ(ˇ ξ) and H K,ii is the diagonal block corresponding to the ith curve.
Let w K (s, t) = φ T K,t H K,ii φ K,t for t, s ∈ T andŵ K (s, t) = φ T K,t H K,ii φ K,t . Note the diagonal block of a positive definite matrix is also positive definite. Then {w K (s, t)} is a sequence of continuous positive definite functions. These modifications don't change the statement of condition (A7) in Yao et al. (2005).
Statements of Theorem 4 * and Theorem 5 * are the same as Theorem 4 and Theorem 5 in Yao et al. (2005), except for the addition of conditions (D1) and (D2).
Finally, we restate Corollary 2 * as follows, Under the assumptions of Theorem 5 * , where χ 2 d,1−α is the (1 − α)th percentile of the Chi-square distribution with d degrees of freedom. In this corollary, the simultaneous confidence region of all fPC scores are considered rather than that of fPC scores of each individual curve in Yao et al. (2005). Recall that ξ is the vector of fPC scores from all curves stacked together and H K is the covariance ofˇ ξ − ξ. The arguments to prove Corollary 2 * remain the same if we replace ξ i , ξ i ,ξ i and Ω K in Yao et al. (2005) with ξ, ξ,ˇ ξ and H K respectively, observing that the validity of Corollary 2 in Yao et al. (2005) depends only on the positive-definiteness of Ω K and joint Gaussian assumption on ξ i which all apply to the counterparts of correlated case.

B Proof of Lemma 4
Proof. According to the assumption, for any > 0, there exists M such that
Let H K,ii be the diagonal sub-block of H K corresponding to curve i. Theorem 4 * which is the counterpart of Theorem 4 in Yao et al. (2005) establishes that the distribution of X K,i (t) − X i (t) may be asymptotically approximated by N (0, φ T K,t H K,ii φ K,t ). Assuming "weak" spatial correlation described in conditions (D1) and (D2), we construct the asymptotic pointwise confidence intervals for X i (t),