A note on repeated measures analysis for functional data

In this paper, the repeated measures analysis for functional data is considered. The known testing procedures for this problem are based on test statistic being the integral of the difference between sample mean functions, which takes into account only “between group variability”. We modify this test statistic to use also information about “within group variability”. More precisely, we construct the new test statistics being integral and supremum of pointwise test statistic obtained by adapting the classical paired t-test statistic to functional data framework. The testing procedures are based on different methods of approximating the null distribution of the test statistics, namely the Box-type approximation, nonparametric and parametric bootstrap and permutation approaches. These approximations do not perform equally well under finite samples, which is established in simulation experiments, indicating the best new tests. The simulations and an application to mortality data suggest that some of the new procedures outperform the known tests in terms of size control and power.


Introduction
In the recent past two decades, the functional data analysis (FDA) dealing with samples of random curves or functions has increased its popularity. The first reason for this is the technological progress allowing to continuously monitor and record data for many processes in biostatistics, chemometrics, econometrics, geophysics, medicine, meteorology etc. Moreover, the increasing computer power allows to conduct an effective analysis of such (high-dimensional) data in a reasonable time. Secondly, the monographs by Ramsay and Silverman (1997, 2005 offering a broad range of the FDA methods and presenting their applications to real data problems have helped in B Łukasz Smaga ls@amu.edu.pl 1 Faculty of Mathematics and Computer Science, Adam Mickiewicz University, ul. Umultowska 87, 61-614 Poznań, Poland learning the basics of the FDA in an orderly and accessible way for a wide audience. The nonparametric perspective of the FDA is proposed by Ferraty and Vieu (2006). The books by Horváth and Kokoszka (2012) and Zhang (2013) complement the previous monographs on new theoretical and practical aspects of the FDA.
Testing statistical hypothesis about functions or curves is one of the most commonly investigated problems in the FDA. The testing procedures for functional analysis of variance problem, where one wants to check the equality of mean functions in a number of groups of independent functional observations, are very popular. In Cuevas et al. (2004), Górecki and Smaga (2015), Ramsay and Silverman (2005), Zhang (2013), Zhang et al. (2018) and Zhang and Liang (2014), the parametric functional ANOVA was studied by a broad range of different methods. On the other hand, Delicado (2007) considered nonparametric functional ANOVA, while Górecki and Smaga (2017) investigated the multivariate analysis of variance for functional data. Some other hypothesis testing problems are as follows: Gabrys and Kokoszka (2007) proposed a simple portmanteau test of independence for functional observations based on their principal component expansion. In functional regression analysis, the statistical significance of (fixed or functional) coefficients was investigated, for example, by Collazos et al. (2016), Kokoszka et al. (2008) and Maslova et al. (2010). Testing the linearity of a scalar-on-function regression relationship is considered by McLean et al. (2015). Kosiorowski et al. (2017) proposed a novel statistic for detecting a structural change in functional time series based on a local Wilcoxon statistic. The interval testing procedure for functional data in different frameworks (e.g., one or two-population frameworks) was proposed by Pini and Vantini (2017). Their procedure is based on means of different basis expansions. In the case of spatial functional data, Giraldo et al. (2018) proposed the test based on the Mantel statistic for checking spatial autocorrelation, which is an important step in the statistical analysis of spatial data.
In this paper, we consider the repeated measures analysis for functional data, where we have the repeated functional data for the same subjects probably submitted to different conditions. Martínez-Camblor and Corral (2011) proposed first testing procedure for this problem. Their tests are based on test statistic being the integral of the difference between sample mean functions and bootstrap and permutation approaches. Smaga (2017) proposed the test based on the same test statistic, but its distribution was approximated by the Box-type approximation, which resulted in less time-consuming procedure than resampling and permutation methods. Nevertheless, all these tests perform equally well in terms of size control and power under finite samples. Note that since the tests considered in these papers are constructed mainly on asymptotic distribution of the test statistic, it is based only on "between group variability". In this work, we show how to modify this test statistic to take into account also "within group variability". Namely, we adapt the usual test statistic of the classical paired two-sample test to functional data framework obtaining an appropriate pointwise test statistic. We construct two test statistics being integral and supremum of the pointwise test statistic. Their distributions are approximated by parametric methods based on derived asymptotic distributions as well as nonparametric bootstrap and permutation approaches. Simulation results indicate that the new testing procedures may not perform equally well under finite samples (in contrast to methods proposed by Corral 2011, andSmaga 2017), and some of them are not recommended when the sample size is small. However, the best new permutation tests usually work better than the tests by Martínez-Camblor and Corral (2011) and Smaga (2017), which is seen in simulations and real data application.
The rest of the paper is organized as follows: In Sect. 2, we first formulate the repeated measures analysis for functional data. Then, we propose new test statistics for this problem and consider different methods of approximating their null distribution to construct testing procedures. Section 3 is devoted to Monte Carlo simulation studies providing an idea of finite sample behavior of the new tests and comparing them with the known tests. In Sect. 4, a real data example based on the mortality data is presented. Section 5 concludes the paper. In Appendix, the technical assumptions and the proof of theoretical result are given.

Repeated measures analysis for functional data
In this section, we present new testing procedures for the repeated measures analysis for functional data. They can be seen as modifications of the tests proposed in Corral (2011) andSmaga (2017) by taking into account also "within group variability".
We formulate the repeated measures analysis for functional data using the notation of Martínez-Camblor and Corral (2011). In this setting, we are interested in checking the equality of two functions obtained from the same subject (probably submitted to different conditions). Assume that X 1 (t), . . . , X n (t), t ∈ [0, 2] is a functional sample consisting of independent stochastic processes. Since t ∈ [0, 2], we ignore the (possible) period in which the subjects are not monitored. Additionally, we suppose that the following additivity assumption holds: where i = 1, . . . , n, t ∈ [0, 2], e i (t) is random function with zero mean and covariance function C(s, t), s, t ∈ [0, 2]. By (1), the null hypothesis is given by the following formula: while the alternative hypothesis is H 1 : m(t) = m(t + 1) for some t ∈ [0, 1].

Test statistics
In classical paired t-test for sample of random variables X i = (X i,1 , X i,2 ) , i = 1, . . . , n, the test statistic is constructed based on the statistic of the form whereD n = n −1 n i=1 D i =X 1 −X 2 is the difference of the means, D i = X i,1 − X i,2 are the differences of the pairs for i = 1, . . . , n and V 2 is the sample variance of the D i 's. Recently, Konietschke and Pauly (2014) studied various bootstrap and permutation methods to approximate the null distribution of the paired t-test type statistic obtaining better testing procedures than the standard paired t-test. For paired sample problem for functional data, Martínez-Camblor and Corral (2011) and Smaga (2017) constructed tests by adapting the numerator of (3) ("between group variability") to functional data framework, i.e., they considered the following test statistic The tests based on C n are quite powerful. However, there is a question of adapting the "whole" test statistic (3) ("between and within group variabilities") to repeated functional data to obtain even more powerful testing procedure. The answer seems to be positive. The natural pointwise version of V 2 n is as follows is the unbiased estimator of the covariance function C(s, t), s, t ∈ [0, 2] and K(s, t) =Ĉ(s, t) −Ĉ(s, t + 1) −Ĉ(s + 1, t) +Ĉ(s + 1, t + 1) s, t ∈ [0, 1] is the unbiased estimator of the asymptotic covariance function of the process n 1/2 (X (t) −X (t + 1)) under the null hypothesis (2) (see the proof of Theorem 1 in Martínez-Camblor and Corral 2011). Therefore, the pointwise adaptation of the test statistic (3) to the repeated measures analysis for functional data can be of the form To obtain test statistic from (6), we can simply use the integration as in C n , namely the first new test statistic is This test statistic is similar to that of the globalizing pointwise F-test for analysis of variance problem for functional data considered by Zhang and Liang (2014). Although integration seems to be natural operation on (6), some other use of the pointwise test statistic may result in construction of more powerful test. Zhang et al. (2018) based their testing procedure on the supremum of pointwise test statistic for one-way functional analysis of variance obtaining better test than the one of Zhang and Liang (2014) in case of high and moderate correlation between observations at any two different time points. Following this idea, the second new test statistic, which we propose, is of the form The tests based on the test statistics D n and E n can be seen as adaptations of the globalizing pointwise F-test of Zhang and Liang (2014) and the F max -test of Zhang et al. (2018) to the repeated measures analysis considered, respectively. Namely, the "between group variability" n(X (t)−X (t +1)) 2 corresponds to the pointwise betweensubject variation SSR n (t), while the "within group variability"K(t, t) corresponds to the within-subject variation SSE n (t).
In fact, the test statistics D n and E n can use more information from the data than the test statistic C n , which may result in more powerful testing procedures for (2), as we will see in simulation experiments and real data example (Sects. 3 and 4).

Approximating the null distribution
To construct the tests based on the test statistics D n and E n , we use different methods to approximate their null distributions. Similarly to Martínez-Camblor and Corral (2011), we start with the use of the asymptotic null distributions of the test statistics D n and E n . They are presented in the following result. Its proof and technical assumptions are given and discussed in Appendix. Throughout the paper,  1. If Assumptions A1-A5 given in Appendix are satisfied, then as n → ∞, where z(t), t ∈ [0, 2] is a Gaussian process with zero mean and covariance function C(s, t), s, t ∈ [0, 2], λ * 1 , λ * 2 , . . . are the decreasing-ordered eigenvalues of K * (s, t) = K(s, t)/(K(s, s)K(t, t)) 1/2 , and A 1 , A 2 , . . . are the independent random variables following a central chi-squared distribution with one degree of freedom. 2. If Assumptions A1-A6 given in Appendix are satisfied, then is the same Gaussian process as in 1.
The asymptotic null distributions of the test statistics D n and E n given in (7) and (8) containing the Gaussian process z(t), t ∈ [0, 2] are not known in full, since they depend on the unknown covariance function K(s, t), s, t ∈ [0, 1]. However, they can be used to develop a simple Monte Carlo method to estimate the p-values of the tests. For this purpose, we have to generate the Gaussian processes z b (t), t ∈ [0, 2], b = 1, . . . , B with zero mean and covariance functionĈ(s, t), s, t ∈ [0, 2] given by (4), where B is a large number. The p-values of the testing procedures based on the test statistics D n and E n are then as follows

respectively, where I (A) denotes the indicator function on the set A (takes value 1 if
A is true and 0 otherwise) andK(s, t), s, t ∈ [0, 1] is given by (5). This method is similar to the methods used by Cuevas et al. (2004) and Martínez-Camblor and Corral (2011). We will refer to it as parametric bootstrap method. By (7), the asymptotic null distribution of D n is a central χ 2 -type mixture. The distribution of this mixture can be approximated by the Box-type approximation (Box 1954). Recently, Smaga (2017) used this method to approximate the asymptotic null distribution of C n . We can easily adapt his results to our purpose, since the only difference between the asymptotic null distributions of C n and D n is that the eigenvalues are of different covariance functions K(s, t) and K * (s, t), s, t ∈ [0, 1], respectively. Namely, we approximate the asymptotic null distribution of D n by that of βχ 2 d , where and , as tr(K * ) = 1. We obtain the plug-in estimatorsβ andd of β and d using the estimate (5) of the covariance function K(s, t), s, t ∈ [0, 1]. These estimators are consistent under Assumptions presented in Appendix, which can be proved similarly as Theorem 2.1 in Smaga (2017). Then, the estimated p-value is of the form The parametric bootstrap method is much more time-consuming and more liberal than the test based on Box-type approximation (see Sect. 3). The distribution of a χ 2 -type mixture can be also approximated by exact methods as, for example, the Imhof's algorithm (Imhof 1961). However, the simulation results (not shown) indicate that such algorithms work similarly to the Box-type approximation, but are usually more time-consuming. Thus, we do not consider the exact methods in the paper.
Since the tests mentioned above are constructed based on the asymptotic null distributions of the test statistics, they may need larger number of observations to perform satisfactorily, as we will see in simulations. Thus, we also investigate nonparametric approaches, namely nonparametric bootstrap and permutation procedures, which were considered by Martínez-Camblor and Corral (2011) and Konietschke and Pauly (2014). These procedures do not require the assumptions as the tests based on the asymptotic null distributions of the test statistics.
In nonparametric bootstrap approach of Martínez-Camblor and Corral (2011), we first select independent bootstrap samples X * ,k where K 1 is a large number. Then the estimated p-values of the tests based on the test statistics D n and E n are 1] are the sample mean and the estimator of covariance function K(s, t) computed on the bootstrap sample. We refer to this method as the nonparametric bootstrap method (I) or NB(I) for short. Following Konietschke and Pauly (2014), we also consider other nonparametric bootstrap method, which will be referred to as the nonparametric bootstrap method (II) or NB(II) for short. In this method, we first center the sample, i.e., X 1, Finally, the estimated p-values are of the form 1] are the sample mean and the estimator of covariance function K(s, t) computed on the bootstrap sample.
Note that the nonparametric bootstrap tests take into account the null hypothesis at the moment of computing the value of the test statistics instead of at the resampling moment, as it is suggested, for example, in Corral (2011, 2012).
On the other hand, the null hypothesis is involved at the moment to draw the permutation sample. Namely, the permutation method is based on selected independent permutation samples X ,l . . , n, and then the p-values of the permutation tests based on the test statistics D n and E n are given by whereX ,l (t), t ∈ [0, 2] andK ,l (s, t), s, t ∈ [0, 1] are the sample mean and the estimator of covariance function K(s, t) computed on the permutation sample. Simulation and real data example results in Martínez-Camblor and Corral (2011) and Smaga (2017) indicate that the all tests (namely, using parametric and nonparametric bootstrap, permutation approach and Box-type approximation) based on the test statistic C n perform very similarly under finite samples in terms of size control and power. However, this is not true for the testing procedures based on the new test statistics D n and E n as we will see in the next sections.

Simulation studies
In this section, the simulation studies are performed to investigate the behavior of the tests proposed in Sect. 2 with regard to maintaining the pre-assigned type-I error level under the null hypothesis, and the power under alternatives. Since all testing procedures based on the test statistic C n behave very similarly under finite samples, we considered only the least time-consuming BT test by Smaga (2017), as a competitor to the new tests.

Simulation setup
The paired functional samples are generated according to the formulas: The number of observations n is set to 15, 25, 35.
We consider the following three models of particular forms of mean functions: The null (resp. alternative) hypothesis holds under Model M1 (resp. Models M2 and M3). Models M1, M2 and M3 are very similar to simulation models M4, M6 and M5 of Martínez-Camblor and Corral (2011). Figure 3 of that paper shows the shapes of the above mean functions. We also considered other models of Martínez-Camblor and Corral (2011), but the results were very similar or uninteresting (e.g., the empirical powers were always equal to 100%), so they are omitted to save space. The following three settings of different error types were considered: where i = 1, . . . , n, t ∈ [0, 1], B i1 and B i2 are two independent standard Brownian Bridges, ρ = 0, 0.25, 0.5, 0.75. The error functions exp(e i j (t)), i = 1, . . . , n, j = 1, 2 are adequately centered.
The empirical sizes and powers of the tests were estimated based on 1000 simulation samples. The p-values of the bootstrap and permutation testing procedures were computed from 1000 replications. For simplicity, the nominal significance level α = 5% is chosen. The simulation experiments as well as the real data example of Sect. 4 were conducted in the R program (R Core Team 2017).

Simulation results
The empirical sizes of the tests obtained under Model M1 are given in Table 1. First of all, we observe that the parametric bootstrap test based on the test statistic E n tends to be highly liberal in all cases. The tests based on the test statistic D n and the Box-type approximation and parametric bootstrap approach are also too liberal especially for greater number of design time points (I = 101) or small number of observations (n = 15). Thus, the tests based on the asymptotic null distributions of the test statistics D n and E n may require a larger number of observations to maintain the pre-assigned type-I error rate. On the other hand, the nonparametric bootstrap testing procedures usually have conservative character, which is especially evident in case  0.25 6.2 7.5 7.9 5.1 6.7 6.2 11.8 4.6 5.1 5.8 6.2 7.5 7.2 4.6 6.4 5.9 12.5 4.8 4.9 6.1 0.50 6.4 7.5 7.9 5.5 7.5 6.1 11.2 5.9 5.2 5.6 6.0 6.9 6.9 4.5 6.9 5.5 12.2 4.3 4.2 6.1 0.75 6.7 7.5 7.8 5.9 9.1 6.4 11.4 5.9 4.7 6.3 5.9 6.1 6.4 4.6 9.0 5.5 11.8 5.2 3.2 6.1 The column "S" refers to settings of error types (N, normal; L, lognormal; M, mixed). Moreover, BT, PB, NB(I), NB(II) and P refer to Box-type approximation, parametric boostrap, nonparametric bootstrap and permutation approach, respectively of small sample size (n = 15). In most cases, the nonparametric bootstrap method (II) gives better results than the nonparametric bootstrap method (I), i.e., NB(II) is less conservative than NB(I). However, we also observe that the test based on the test statistic D n and the nonparametric bootstrap method (II) may have too liberal character under mixed setting, that is, under non-exchangeable distributions. The BT test and the permutation tests based on the test statistics D n and E n control the nominal type-I error level, but the BT test may tend to highly over-reject the null hypothesis when the number of observations is small or the distributions in the groups are different (mixed setting). Now, we investigate the empirical power of tests, which do not have too liberal character in most cases, namely the BT test and the nonparametric bootstrap and permutation tests based on the test statistics D n and E n . (Note, however, that the power of the BT test may be overstated under normal and mixed settings. The same is true for the test based on the test statistics D n and the nonparametric bootstrap method (II) under mixed setting.) The empirical power of the other tests is not comparable, as they do not maintain the type-I error rate. The empirical powers obtained under Models M2 and M3 are presented in Tables 2 and 3. In general, the empirical power increases when the number of observations or the correlation between curves also increases. The estimated statistical powers are often comparable for different numbers of design time points. However, it may happen that they increase with I , which is indicated by simulation results not shown. Unfortunately, they may also decrease with I for the tests based on the test statistic E n (especially nonparametric bootstrap method (I)) when the number of observations is small.
The empirical powers of the nonparametric bootstrap tests are usually smaller than these of the permutation procedures, which follows from conservativity of the first tests. From permutation tests, the test based on E n outperforms that based on D n , but sometimes their powers are comparable. In the group of nonparametric bootstrap methods, the NB(II) approach usually has greater empirical power, as it is less conservative than the NB(I) method. Under normal and mixed settings, the permutation test based on D n works better than the BT test, except under mixed setting and ρ = 0.75, where the empirical powers of these tests are comparable. Under the same settings, the nonparametric bootstrap tests based on E n usually have greater power than the permutation procedure based on D n . Under lognormal setting, the empirical powers of the BT test, permutation test based on D n and nonparametric bootstrap tests based on E n are comparable for n = 25, 35, but when n = 15, the last three tests perform usually better, worse than the others, respectively. The conservativity of the nonparametric bootstrap tests based on D n implies that they often have smaller power than the BT test.
To summarize, the permutation test based on E n usually outperforms the other tests, but the permutation procedure based on D n also works quite well and has larger power than the BT test in most cases. The nonparametric bootstrap tests have conservative character especially in case of small sample size, which may result in the loss of power. On the other hand, the asymptotic tests based on the test statistics D n and E n are not recommended because of their liberality.
The finite sample behavior of the tests under the null and alternative hypotheses discussed here will be seen in the real data application of the next section.  The column "S" refers to settings of error types (N, normal; L, lognormal; M, mixed). Moreover, BT, NB(I), NB(II) and P refer to Box-type approximation, nonparametric bootstrap and permutation approach, respectively Table 3 Empirical powers (as percentages) of the nonparametric bootstrap and permutation tests based on the test statistics       The column "S" refers to settings of error types (N, normal; L, lognormal; M, mixed). Moreover, BT, NB(I), NB(II) and P refer to Box-type approximation, nonparametric bootstrap and permutation approach, respectively

Real data example
In this section, we present a real data example to illustrate the practical use of the tests proposed in Sect. 2 and of the tests of Martínez-Camblor and Corral (2011) and Smaga (2017). For this purpose, we consider the human mortality data, which are often used to quantify health state of people living in different countries and to appraise biological limits of longevity (Oeppen and Vaupel 2002;Vaupel et al. 1998).
The data were obtained from the Human Mortality Database (Wilmoth et al. 2017; http://www.mortality.org).
For illustrative purposes, we take into account the mortality rates for two decades 1980-1989 and 1990-1999 for the following 32 countries: Australia, Austria, Belarus, Belgium, Bulgaria, Canada, Czech Republic, Denmark, Estonia, Finland, France, Hungary, Iceland, Ireland, Italy, Japan, Latvia, Lithuania, Luxembourg, Netherlands, New Zealand, Norway, Poland, Portugal, Russia, Slovakia, Spain, Sweden, Switzerland, United Kingdom, Ukraine, USA. We focus on the mortality rates of older individuals, i.e., in age from 60 to 100 years. The death rates for these countries were also studied by using the principal component analysis for repeated functional observations in Chen and Müller (2012), but for longer time period.
For a given decade, the observed period mortality rates can be treated as realizations of some random process being a function of age (t ∈ [60, 100]). Moreover, the observations of the mortality rates for two decades can be considered as two samples of repeated functional data, since they are measured repeatedly for different countries, which are the subjects in this example. Thus, we have n = 32 repeated functional observations measured at I = 41 design time points in each of two conditions (decades). These data as well as the corresponding sample mean functions are shown in the first row of Fig. 1.
We are interested in testing the equality of the mean curves of mortality rates in two decades, which is a repeated measures analysis for functional data. To solve it, we apply the tests proposed in Sect. 2 and those by Martínez-Camblor and Corral (2011) and Smaga (2017). The obtained p-values are presented in Table 4 (n = 32). All new testing procedures reject the null hypothesis and their p-values are almost all equal to zero. On the other hand, the p-values of the tests based on C n are slightly greater than standard significance level α = 5%, so these tests do not reject the null hypothesis and their decision is not sure. Since the tests based on D n and E n are strongly against the null hypothesis and the tests based on C n are on boundary of rejecting and accepting this hypothesis, the mean functions of mortality rates for two decades 1980-1989 and 1990-1999 seem to be unlikely the same.
The sample mean functions for mortality data indicate that the death rates decreased in decade 1990-1999 compared to decade 1980-1989 for most ages. However, for ages about 100 years, the reverse seems to be true. This may be due to the fact that for five countries, namely Belarus, Bulgaria, Lithuania, Russia and Ukraine, the mortality rates increased significantly over time, especially for people in age from 90 to 100 years (see the second row of Fig. 1). On the other hand, the other countries share a declining trend of death rates with increasing decades (see the third row of Fig. 1). This was also observed in analysis presented in Chen and Müller (2012). Moreover, most of the tests based on the test statistics C n , D n and  Fig. 1 The curves of mortality rates for two decades 1980-1989 and 1990-1999 and the all n = 32 countries considered (the first row), the n = 5 Eastern European countries (the second row), the n = 27 other countries (the third row) and the sample mean functions (solid line for decade 1980-1989; dashed line for decade 1990-1999) E n applied to these two groups of countries (the n = 5 Eastern European countries and the n = 27 other countries) reject the null hypothesis and indicate the significance of the above observations (Table 4). Note, however, that for the Eastern European countries, the tests based on the nonparametric bootstrap method (I) and on the test statistics D n and E n fail to reject the null hypothesis, which is an effect  of small sample size (n = 5) and conservativity of these tests indicated in simulations of Sect. 3. On the other hand, the tests based on the nonparametric bootstrap method (II) have much smaller p-values and reject the null hypothesis, which confirms their less conservative character than that of the first nonparametric bootstrap approach.

Conclusions
In this paper, we have studied the testing procedures for the repeated measures analysis for functional data, which are based on test statistics taking into account both "between and within group variabilities", in contrast to tests by Martínez-Camblor and Corral (2011) and Smaga (2017). By simulation studies and real data example, different proposed methods of approximating the null distribution of the test statistics do not work equally well under finite samples. The Box-type approximation and parametric bootstrap are too liberal when the number of observations is small, and therefore they are not recommended in such situation. On the other hand, the nonparametric bootstrap method may result in conservative tests and the loss of power may be noted.
In terms of size control and power, the permutation test based on the test statistic being the supremum of the pointwise test statistic seems to be the best testing procedure, which also outperforms the tests of Martínez-Camblor and Corral (2011) and Smaga (2017). We also observed that the procedures proposed in these papers may be too anticonservative for very small sample size (n ≤ 15).